Facebook has an explanation for its massive Monday outage
In a statement, Facebook said that a maintenance error accidentally disconnected their servers from the rest of the internet.
Yesterday, Facebook users everywhere experienced an unexpected and prolonged service blackout that affected access to all of its apps, including WhatsApp, Instagram, and Messenger. In the time since, Facebook has published two blog posts explaining what happened.
On late Monday evening, the company published the first blog post explaining what caused the dramatic problem. Santosh Janardhan, VP of infrastructure at Facebook, wrote that “the root cause of this outage was a faulty configuration change,” elaborating that “configuration changes on the backbone routers that coordinate network traffic between Facebook’s data centers” were where the issues occurred.
That network traffic disruption not only halted services on Facebook-owned apps such as WhatsApp but it also “impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem,” Janardhan adds.
[Related: What we know about why Facebook went down]
Facebook has since published another, more detailed blog post late this afternoon explaining exactly what went wrong. In it, Janardhan writes that “the backbone” he previously referenced “is the network Facebook has built to connect all our computing facilities together,” and this network also links together all of Facebook’s data centers across the world through physical wires and cables. These data centers are responsible for storing data, keeping the platform running, and connecting Facebook’s network to the rest of the internet.
“The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself,” Janardhan explained.
But yesterday, during a routine maintenance job, “a command was issued with the intention to assess the availability of global backbone capacity,” but it “unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally” from each other, and severing their connection to the internet. To make matters worse, the audit tool that usually prevents mistakes like this didn’t catch the problem, due to a bug.
A related issue involves two other pieces of internet architecture: the Domain Name System (DNS) servers and the Border Gateway Protocol (BGP), which advertises the Facebook DNS to the rest of the internet.
“The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan wrote. “The total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.”
All of that sounds pretty technical, so in layman’s terms, here’s what to know about DNS, BGP, and what took place at Facebook.
Let’s talk about DNS and BGP
Let’s start with the Domain Name System (DNS) servers and Border Gateway Protocol (BGP). So what exactly are they?
The DNS is often referred to as the address book, or phonebook, of the internet. “What it does is when I have a domain name, which is designed to be human-readable—something like Google.com or Facebook.com—it turns that into an IP address, which is some string of numbers,” Justine Sherry, an assistant professor at Carnegie Mellon University, tells Popular Science. “And that’s much like your street address. So it’s like 5000 Forbes Ave versus Carnegie Mellon University.”
This phonebook feature, which Sherry noticed was missing yesterday when she tried to log into Facebook, is important because it’s the service that takes the human-readable domain name (facebook.com) you type into your search bar, and then tells the internet how to steer you to the server you want to talk to. After all, it’s easier for people to type the letters facebook.com into a web browser than to remember and enter numbers.
[Related: Facebook users can now mix Messenger and Instagram friends in group chats]
“Importantly, that phonebook is distributed, so Facebook kinda owns a slice of that phonebook saying ‘we are Facebook.com, and these are our addresses,’” Sherry explains. “When I typed in the URL I got an error that said NXDOMAIN, and that was the DNS telling me, ‘I don’t know what that domain name is, it doesn’t point to any address for me.’”
Then there’s the service called BGP, which stands for Border Gateway Protocol. “You can think of BGP as the Google Maps of the internet. That’s the thing that tells you if I have an address, how do I get there,” says Sherry. “It’s designed to allow different networks from different organizations like Facebook, Google, Comcast, Sprint, and AT&T to all share what routes they have.”
Ethan Katz-Bassett, an associate professor at Columbia University, says in an email that the Border Gateway Protocol (it’s called that because it runs at the borders between networks like Facebook and Google) sets up a route for requests for access to reach the Facebook DNS server.
The misconfiguration resulted in Facebook’s BGP routers no longer advertising a route to the Facebook DNS servers. Therefore, the requests would “get lost” at the edge of the sender’s network. “The [Facebook] system was designed such that, if a router can’t talk to a data center, it withdraws the DNS route,” Katz-Bassett writes. “This might be an ok behavior when a single router is having a problem, but it disconnects everything when they all are having problems at once.”
Sherry compares BGP to the interstate highway system: “This is the thing that glues together the different states’ highways. Facebook withdrew a bunch of their routes and started saying that they didn’t have routes to get to their phonebook.”
So why did engineers have to go down to the California data center?
During yesterday’s outage, the internet’s mapping system had essentially erased all of the routes to get to Facebook, which not only meant that everyday customers couldn’t access it, but its employees couldn’t either (at least, not remotely).
Sherry speculates that Facebook probably networked all the digital badge cards to an internal database hosted on their own servers and DNS that would keep track of who has access to the building. And when their DNS and servers went down, the card key system also stopped functioning.
Normally, when engineers work with servers, they don’t have to be physically near them. They can log in remotely to access and interact with the machines, and work on them over the internet. However, in this case, they couldn’t access it remotely, and so the only way to get access is to go in physically and plug in a monitor to those servers.
Facebook said that they sent engineers onsite to the physical data centers in order to debug and restart the systems. “This took time, because these facilities are designed with high levels of physical and system security in mind,” Janardhan stated in the blog. “They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.” To prevent abrupt power surges or crashes, they turned services back on in a piecemeal fashion.
Many computer scientists observed overloads and backups in the internet infrastructure yesterday prior to Facebook coming back online. Cloudflare reported that they received 30 times more queries than usual through their DNS services. That’s because when your web browser tries to load Facebook or Instagram and it can’t find it, it tries again. “People were constantly querying the phonebook over and over again going ‘where’s Facebook? Where’s Facebook? Where’s Facebook?’” says Sherry.
[Related: Beams of light—not cables—are carrying the internet across a river in the Congo]
The outlook for Facebook and the questions that remain
A decade ago, an issue like this wouldn’t have been quite as widespread. WhatsApp, Instagram, and Facebook were all independent companies running on independent infrastructure. “And now, all of those are the same product,” Sherry says. “What we saw yesterday were companies that were unable to operate.” A lot of local businesses with a Facebook page and an Instagram page that they use to reach customers couldn’t anymore. Millions of users flocked to alternative messaging apps like Signal, Telegram, and even Twitter, Bloomberg reported.
“We see these failures a couple of times a year now, where large swaths of the internet are going down. Sometimes it’s BGP, sometimes it’s the DNS, sometimes it’s some esoteric storage system that Amazon uses internally,” says Sherry. But now, “every company, every business, every organization relies on just a handful of companies, a handful of technical products, and when these things fail, they have huge cascading effects across the internet and across different industries.”
For her, the biggest impact was with WhatsApp, a service that she uses to contact family. “There are many places in the world where WhatsApp is cell phone service,” she says.
[Related: Not a single federal agency received an ‘A’ in a new Senate cybersecurity report card]
Still, in general, it’s unlikely for many separate services to go down at once, Sherry says. For example, it would be rare for both Facebook and Twitter to simultaneously crash, or for both Google Chat and Facebook Messenger to experience technical issues concurrently. “But more and more platforms are centralizing and merging, leaving us increasingly vulnerable to very large-scale outages with huge impact,” she says. “What we saw yesterday were cascading failures because all of Facebook’s services (down to the access control on their doors) relied on one, centralized system.”
Sherry notes that the engineering community has also long held that combined, centralized systems are not the most desirable design. “The thing that’s safest is to keep things separate so that when one system fails, it’s a small and local failure, and not an entire global outage,” she says. “And so this push towards ‘one organization handles it all’ makes us more vulnerable to having these big and catastrophic problems when they could’ve been small problems.”