Post Mortem



Grant Slater (February 15, 2025) "Post Mortem : Network Outage Affecting OpenStreetMap.org - 15 December 2024" OpenStreetMap Foundation - Operations Working Group

15 December 2024, 03:53
HE.net suffers a routing equipment failure in Amsterdam, causing OpenStreetMap.org to become unreachable. Both of OpenStreetMap’s redundant diverse fibre links to HE.net were offline due the failure of the HE.net’s equipment.

*** 

HE.net
Hurricane Electric Internet Services
Internet Backbone and Colocation Provider

HE operates its own global IPv4 and IPv6 network and is considered the largest IPv6 backbone in the world as measured by number of networks connected. Within its global network, HE is connected to over 320 major exchange points and exchanges traffic directly with more than 9,600 different networks. 

Employing a fiber-optic topology, HE has no less than five redundant 100G paths crossing North America, five separate 100G paths between the US and Europe, and 100G rings in Europe and Asia. 

Founded in the garage of President Mike Leber in 1994.

***

Rob Powell (June 29, 2015) "Industry Spotlight: Hurricane Electric CEO Mike Leber" Telecom Ramblings

"I had been working in software engineering for about 15 years and worked at a number of high-tech companies setting up local area networks. I knew about the internet and ARPAnet, and in 1989, I tried to buy a connection. However, it wasn't commercialized completely, it was academic and research-lab oriented, and they wouldn't sell it to me."

"In 1993, I became familiar with HTTP and the Mosaic browser with images, and I knew that was going to be huge."

"In 1994, I was actually able to buy a connection at last. Instead of consulting and helping people set up networks, I realized I could operate a network and servers as a business."

"In 1995-1996, we started running BGP and connected to MAE-West and PAIX Palo Alto, and then we expanded to MAE-West in Virginia and to the AADS exchange in Chicago with Ameritech. "

***

BGP (Border Gateway Protocol) : The routing protocol used to exchange traffic between different networks (Autonomous Systems).

MAE-West (Metropolitan Area Exchange – West)  : A major Internet Exchange Point (IXP) located in San Jose, California

PAIX (Palo Alto Internet Exchange) : One of the earliest commercial IXPs, located in Palo Alto, California

AADS (Ameritech Advanced Data Services) Exchange – A major IXP in Chicago

***

"Basically, we expanded the network organically, city by city, as each case could be made to get to a place with a large density of networks. We would look at places where there were a lot of metro fiber providers and regional carriers in a particular building and go there and make it so they can bring customers there to our core routers to get IP transit. And that's the repeat story all around the world for our strategy. We end up facilitating a lot of local loop revenue and circuit revenue for carriers in the region, as they sell last mile or sometimes longer circuits to get to us."

"We just brought up Rome recently (2015), and also Osaka earlier in the year. We're in the process of bringing up Seoul right now."

"It's not about cities. It's about very carefully analyzing the market and noting there's a lot of stranded fiber assets out there. It does nobody any good in the industry when you have different fiber providers whose nodes are in separate buildings in a city and there are no overlapping, common meet points. It's much better when there are a couple of carrier-neutral data center where the people --- who spent the gigantic amount of capital it takes to put metro or longhaul fiber in the ground -- can overlap some. And that creates an opportunity for Hurricane Electric to sell IP transit and opportunity for those facilities-based carriers to realize more of the potential of their cable plan."

"We've been through two big downturns in this industry, with the dot-come bomb and the financial crisis of 2008-2009. We got through them fine."

"Hurricane is closely held and I'm not interested in taking equity money from somebody who expects an exit strategy or liquidity event. "

"What type of assets would you be most interested in looking at?"

"IP networks probably."

"A lot of the people who have data centers primarily have equipment with a short term lease in someone else's property. For Hurricane, that's not that strong because the strength of that business is determined by what rate they end up renegotiating in 5 or 10 years."

"On the data center side, we'd want to acquire the building. It doesn't need to be wholly owned, but I need to have enough of an interest in the building such that I receive some of the benefit from making it a decent place to do business and build critical mass."

"What factors made you decide to go all-in with IPv6 so early in the process?"

"We deployed IPv6 early as a differentiating strategy because in telecom, there aren't too many differentiators. IPv6 is a new technology, the cost was relatively modest, and it actually serves an important technical purpose." 

"We applied the aggressive strategy for expansion, but almost nobody else showed up at the game with a major network and we ended getting about 20 free laps around the track."

"IP transit is a very competitive market. Have you ever considered expanding into other layers of network with other products like CDN or hosting?"

"The CDN business has a different target customer profile and a different method by which you sell than IP transit or colocation. The top 20 customers of the CDN business are the ones that make you or break you. That's very different than Hurricane's business, which targets the 50,000 or so networks in the world and a couple hundred thousand internet companies. "

"Hurricane is very much an engineering company. Are we good at bolting servers into racks? Yes, but the missing expertise for CDN is the sales side of it."

"There is a tug of war on the internet between content and eyeballs. There are tensions from a business plans perspective and a pulic policy thing when people act to restrict access. These have gotten blended together in a way that deciphering it is extremely complicated. Different companies have different personalities, and the vast majority work their relationships fine in the interest of the public and the industry. But there are always a few companies that like to act out on the public."

***

15 December 2024, 04:01 UTC
StatusCake monitoring tools detect the outage and send SMS alerts. Incident is immediately reported to the Operations team.

15 December 2024, 04:03 UTC
Operations team (via IRC) confirms that OpenStreetMap is offline.

15 December 2024, 04:06 UTC
Grant confirms that this is an external ISP issue with HE.net.

15 December 2024, 04:18 UTC
OSM Operations team emails HE.net's Network Operations Centre (NOC), inquiring if there is any unplanned maintenance.

15 December 2024, 04:24 UTC
HE.net confirms via email that there is no scheduled maintenance. They are investigating the outage.

15 December 2024, 04:24 UTC
OSM Operations contacts HE.net by telephone. 
HE.net states the outage is in Amsterdam, with no estimated time to recover. They are waiting for Equinix "remote hands" to investigate.

***

15 December 2024, 11:31 UTC
A read-only backup instance of of OpenStreetMap -- hosted in Dublin -- is activated. 

A maintenance notice is displayed on OpenStreetMap.org, indicating that data edits cannot be saved.

OpenStreetMap's OAuth service requires database write to store access tokens. So, while the website was read-only, users were unable to login to OpenStreetMap.org, community.openstreetmap.org, umap and other services who relied on OAuth service.

Other Amsterdam-based services, such as the dev server, Taginfo (taginfo.openstreetmap.org), and certain aerial imagery services were unavailable. The tile rendering service and Nominatim geocoding remained online but operated at reduced capacity.

16 December 2024, 10:11 UTC
The required procedure to fully failover to Dublin (Postgres & osmdbt) are documented.

osmdbt
Tools for creating replication feeds from the main OSM database.
You need a C++17 compliant complier (GCC 8 / Clang 7 are known to work).
You also need the following libraries : Libosmium, Protozero, boost-filesystem, boost-program-options, bz2lib, zlib, Expat, cmake, yaml-cpp, libpqxx, Pandoc, gettext, Debian/Ubuntu postgresql-common + postgresql-server-dev-all, Fedora/CentOS postgresql-server + postgresql-server-devel

16 December 2024, 11:55 UTC
OSM Operations team decide to wait for restoration of Amsterdam connectivity instead of Emergency Full Failover due to incomplete replication, operational complexity and perceived risk.

The expected ETA to full restoration of services in communicated with the OSM Communication Working Group, OSMF Board and wider OSM community (Announcement * Talk mailing list, Mastodon, community forum)

Equinix Teams kindly escalates the Equinix Internet connectivity installations in Amsterdam and Dublin. Fibre Light and Routing Test.

17 December 2024, 12:21 UTC
A new Equinix Internet link becomes operational in Amsterdam, bypassing HE.net. 

Full functionality for OpenStreetMap.org and its API is restored through this alternative route, prior to HE.net fixing their equipment.

18 December 2024, 00:29 UTC
HE.net confirms their own service is restored in Amsterdam.

Routing via HE.net resumes normal operation.

All network paths for OpenStreetMap.org are fully available again.

***

Root Cause Analysis

The failure of HE.net routing equipment in Amsterdam was a single point of failure for the Amsterdam servers internet connectivity.

Long Term Mitigation

Transitioned to a multi-ISP architecture to ensure that, if one provider fails, another can switch over with minimal interruption.

Research potential backup ISPs for the Amsterdam servers, comparing bandwith, reliability, and costs. Follow up will be held at the Karlsruhe Hack Weekend - February 2025 to discuss practical options including BGP and retaining HE.net as a fallback ISP.

The Raspberry Pi 4 and 4G modem used for out-of-band access, that should be pre-configured to be able to be used as a manual fallback uplink in the event of an ISP outage. The link would be used for server access and syncing the Postgres replication data to the secondary site.

The OpenStreetMap.org "rails-port" maintainers should investigate if it is possible and practical to de-couple the OAuth service to allow authentication to continue to function for third parties during times when the site is disrupted.

***

Thank you, Equinix Team, for your help getting the new connection and installed quickly.

Thank you, Operations Team (Tom, Paul, Guillaume and others) who helped recover from the outage.

Popular posts from this blog

Stewards Elections 2025

en-wiktionary link & synonyms