The Facebook BGP Outage (2021)

On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger disappeared from the internet for roughly six hours, one of the most visible outages ever for a company at that scale. The services were not merely slow; they were unreachable, as if the entire company had been unplugged from the network. Billions of users were affected, and the outage briefly broke unrelated services that depended on Facebook login.

Meta’s own engineering blog post-mortem explains how it happened. During routine maintenance on the global backbone that connects Meta’s data centers, an engineer issued a command meant to assess available backbone capacity, but the command instead disconnected all of Meta’s backbone connections, isolating the data centers from one another. An audit tool that should have caught and blocked the erroneous command had a bug and failed to stop it.

The fatal twist involved DNS and BGP. Meta’s DNS servers are designed to withdraw their Border Gateway Protocol route advertisements if they lose their own connection to the data centers, a safety mechanism intended to steer traffic away from an unhealthy location. When the backbone vanished, the DNS servers concluded they were unhealthy and withdrew their BGP routes from the rest of the internet. With those routes gone, the world could no longer resolve facebook.com, and the still-running servers became invisible.

The outage then turned on its operators. Because so many of Meta’s internal systems and tools relied on the same DNS and network that had just disappeared, engineers lost the remote means they would normally use to diagnose and fix the problem. Recovery required sending people physically to the data centers, where hardened security and access controls, designed to keep attackers out, also slowed the engineers trying to get in and restart systems by hand.

Bringing everything back was not as simple as flipping a switch. Meta had to be careful that a sudden surge of traffic and the power and cache demands of restarting services did not cause a second crash, so the restoration was paced deliberately. The company’s analysis frames the event as a cascade in which a single command, an undetected audit-tool bug, and an automated safety response combined into a global failure.

The outage became a widely studied example of how DNS and BGP, the two systems that make names and routes work on the internet, can become a single point of fragility, and of how the very tools meant to keep a network safe can amplify a failure when they all depend on the thing that just broke.

The Facebook BGP Outage (2021)

Sources

Related