Why Facebook went down, and what's BGP routing

Billions of people yesterday were forced to remember what life was like before Facebook Inc. Turns out that's because someone at Facebook sent a shoddy update.

As if to kick-off cybersecurity awareness month, Instagram, Messenger, WhatsApp, Oculus, and Facebook all went down for almost six hours on Monday.

An event best illustrated by the official Twitter account tweeting 'hello, literally everyone,' amidst millions flocking to the social network to find out why's the Facebook family of social media channels is down.

This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt,
Santosh Janardhan.

Competing theories from cyber-attack to a flawed update soon emerged with network routing issues causing the outage to receive the most attention. Later the theory was somewhat confirmed by Facebook's VP of Infrastructure Santosh Janardhan.

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt," Janardhan wrote in a blog post.

Fancy tech jargon, sure, but what does that all mean?

Valley, we have a problem

At first, pundits thought there was a problem with the Domain Name System (DNS) provider. The DNS is like a translator. Once we type URLs like facebook.com in a browser, the DNS service matches that domain with a specific IP address. Without that, the computer does not know which server hosts the website we're looking for.

According to a blog by Cloudflare, one of the largest DNS service providers, after registering that Facebook is down, the company checked whether their services work fine. Since the problem affected Instagram, WhatsApp, Messenger, and other Facebook products, however, it soon became apparent that there's are different problems affecting Facebook and its affiliates.

As Cloudflare put it, Facebook's DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had "pulled the cables" from their data centers all at once and disconnected them from the Internet.'

What's BGP?

As all of this was unfolding, CloudFlare's VP of Emerging technologies tweeted that Facebook's BGP routes were withdrawn from the Internet. To grossly oversimply, the BGP or Border Gateway Protocol route works like an old-timey railroad switchman, deciding what tracks data packs should use to travel.

The whole Internet is made of a network of switchmen, informing each other of the best route a specific data package should take to reach its final destination. Yesterday, the switchman letting others know where to find Facebook went AWOL.

That meant that Facebook servers with all the apps billions of uses went missing because there was no one telling how to get there. People were sending queries to Facebook, but they ended up nowhere.

That very idea is reflected in the statement Facebook made after restoring activities. A network likely misguided configuration updated removed Facebook's BGP route, causing an outage and affecting billions worldwide.

Turning on and off

Two questions remain: why the problem took so long to fix and how they eventually did it. First, it's essential to understand that Facebook runs its internal systems using its servers.

Even though that might be useful, it becomes a major problem once the BGP route is gone since it is no longer possible to access internal systems as well.

That's likely the reason why Facebook employees were unable to access conference rooms, and the head of Instagram declaring that Monday feels like a 'snow day,' meaning a complete halt in daily activities.

With many of the tools engineers would likely have used to solve the problem not working, the likeliest way to solve the issue would be to reach Facebook servers in California and reset them physically.

Stealing a thunder

In a weird coincidence for Facebook, the outage happened just one day after a major blow to social networks' reputation.

The company's former civic integrity manager Frances Haugen went public with the allegation that Facebook had known its product negatively affected public safety and chose to ignore it over profit.

Facebook paused the launch of Instagram for kids last week after another leaked internal document showed the company was fully aware of how the Instagram app negatively affects girls' mental health.