AWS outage: when senior engineers leave, let’s not act surprised


The Amazon Web Services (AWS) outage on Monday doesn’t seem to have been caused by a cyberattack. But critics are already pointing to the fact that Amazon has laid off at least 27,000 employees, including – surprise – senior engineers, since 2022.

Key takeaways:

America was still asleep early Monday when Europe, Africa, and Asia noticed something weird happening. Hundreds of apps and websites used by millions every day suddenly went offline.

ADVERTISEMENT

Signal, Snapchat, Fortnite, Starbucks, Reddit, Coinbase, Ring, Amazon, Amazon Alexa, Apple TV, Apple Music, and hundreds of other apps and websites were down for tens of thousands of users for at least four hours.

By now, AWS has already reported that all services impacted by the outage have been resolved. Still, the cloud provider is warning the estimated 1000 organizations caught up in Monday’s outage to expect delays, latencies in network connections, and higher-than-average error rates.

jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News. Add us as your Preferred Source on Google

All is calm, right? Well, two questions of the “Why?” type are now key, one of which AWS has already answered.

When it quacks like a DNS failure

The firm blamed network connectivity issues at AWS's US-EAST-1 data center in Northern Virginia, specifically citing Domain Name System (DNS) resolution failures.

For regular folks out there, DNS takes a “human-readable” domain name and translates it into a “machine-readable” IP address so that computers can communicate. If this translation fails, a computer can’t locate the server hosting the website, and the user can’t connect to that address.

This, of course, begs another question: Why was it allowed to happen? After all, it wasn’t the first time AWS had faced a DNS failure – the service also went down in 2023 and in 2021, when customers found that they couldn’t access airline reservations and payment apps during the five-hour outage.

ADVERTISEMENT
AWS cloud terminals
Image by Cybernews

In the world of tech, there’s even a haiku, proclaiming “It’s not DNS, there is no way it’s DNS, it was DNS.”

There’s no definite answer, to be fair. That’s because it’s complicated, and sometimes, things just happen. But surely we can see the large elephant in the room – the infamous Amazon layoffs, ongoing since 2022.

The e-commerce giant has already let more than 27,000 employees go, and this summer, industry insider Amanda Goodall, AWS is about to reduce its workforce by 10% by the end of 2025, with around 25% of those positions being Principal-level roles.

Amazon's principal engineers are typically seasoned professionals with deep technical expertise. They are responsible for making critical architectural decisions and leading complex projects.

A reduction of this magnitude in such a crucial segment of the workforce could have significant implications for AWS, experts have long said. And now, after Monday’s AWS outage, it’s obvious that they’re absolutely correct.

“Where have the senior AWS engineers who’ve been to this dance before gone? And the answer increasingly is that they've left the building – taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them,” Corey Quinn, the chief cloud economist at The Duckbill Group, wrote on The Register post-outage.

Tribal knowledge leaving AWS

On Monday, Quinn saw clear communication issues. Immediately after the first reports of platforms going offline, AWS said it was investigating in the US-EAST-1 Region.

ADVERTISEMENT

Seventy-five minutes later, the company confirmed “significant error rates for requests made to the DynamoDB endpoint” in that area. After 40 more minutes, engineers finally identified a DNS failure as the root cause of the event.

US East 1
Image by Jonathan Weiss | Shutterstock

To Quinn, that’s not good enough. Because in 2021, AWS itself called out slow outage notification times as an area for improvement. In fact, the firm did the same in 2020.

However, the expert’s main point is that AWS has seriously bled talent and is now paying the price.

At the end of 2023, Justin Garrison, a senior engineer, left AWS and publicly roasted it in a blog post, saying that the firm was already seeing an increase in Large Scale Events and predicting major outages in the near future.

He also said people were leaving AWS in droves: “In my small sphere of people, there wasn’t a single person under an L7 (under the Principal-level) that didn’t want out.”

At the end of 2023, Justin Garrison, a senior engineer, left AWS and publicly roasted it in a blog post, saying that the firm was already seeing an increase in Large Scale Events and predicting major outages in the near future.

Internal documents also say that Amazon (AWS is, of course, a cloud computing subsidiary of the mother giant) suffers from 69-81% regretted attrition across all employment levels.

In other words, many employees are leaving the company, and their departure is regretted by those who remain. And that’s data from 2022: now, it’s almost undoubtedly worse.

Have thoughts about this topic? Others do, too. Join them in the discussion.

It’s a safe bet all these departures have contributed to the outage, albeit indirectly. As Quinn puts it, it’s the tribal knowledge that’s missing now.

ADVERTISEMENT

“You can hire a bunch of very smart people who will explain how DNS works at a deep technical level, but the one thing you can't hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it has historically played a contributing role to some outages of yesteryear,” said Quinn.


Unlock more exclusive Cybernews content on YouTube.