
A trio of autumn outages in a four-week period highlights how configuration and metadata errors in the cloud are becoming “the new power cuts.”
The outages at AWS, Microsoft Azure, and, most recently, Cloudflare resulted from different technical triggers. However, Cybernews and expert analysis reveal that all three stem from failures inside the providers’ own core infrastructure.
In each case, the incidents were not capacity failures resulting from a spike in seasonal traffic, nor were they denial-of-service attacks designed to overwhelm networks. They were subtle internal software and configuration issues in the systems that run in the cloud itself.
“The recent outages are not so much about the cloud being broken, but that we’ve built an incredibly complex system and then concentrated a lot of it in a small number of hands,” Javvad Malik, lead CISO advisor at cybersecurity training platform KnowBe4, tells Cybernews.
Circular dependency
According to cloud expert Mayur Upadhyaya, CEO of APIContext, a software company providing API monitoring and governance, what we’re seeing is the result of three significant shifts converging at once: extreme automation, much heavier machine-driven traffic, and a very high concentration of infrastructure among a small number of providers.
“The internet has quietly become a circular dependency machine: cloud platforms depend on DNS, DNS and control planes run on those same clouds, identity and security depend on both, and CDN (Content Delivery Networks) sit across the whole surface,” he says.
“When something small goes wrong in one of those layers, the blast radius is now global by default. The timing may be coincidental, but the underlying forces are systemic,” he adds.
Cloudflare outage explained
The Cloudflare outage was caused by a configuration file used for threat-traffic management, which grew much larger than expected and triggered a software crash in systems that handle core network traffic for many services.
“A configuration file that grew beyond an expected size of entries triggered a crash in the software system that handles traffic for a number of our services,” it stated.
Given Cloudflare’s role as a global network, numerous high-profile sites were reported to have been affected, including X, ChatGPT, IKEA, and Canva.
AWS down
Tuesday’s outage followed two others from the cloud giants: The first, AWS on October 20th, centered on its US-EAST-1 region (Virginia), which impacted a wide range of services and companies globally.
The root cause was identified as an internal Domain Name Systems fault [a DNS is like the internet’s phone book], which cascaded beyond a single region, knocking multiple third-party services offline, including Signal, Snapchat, Fortnite, Starbucks, Reddit, Coinbase, Ring, Amazon, Amazon Alexa, Apple TV, and Apple Music.
Microsoft Azure downtime
Less than two weeks later, on October 29th, Microsoft cloud provider Azure experienced global downtime triggered by “an inadvertent configuration change” in its CDN service.
The outage affected Microsoft’s own services, such as 365 Copilot, as well as Azure customers, including Minecraft, several airline check-in desks, payment systems, and numerous other Azure-hosted third-party services.
What do Azure, AWS, and Cloudflare outages have in common
In two out of three cases (Azure, Cloudflare), the root cause was a configuration/metadata issue rather than hardware or a DDoS attack.
According to KnowBe4’s Malik, configuration and metadata errors have become “the new power cuts.”
He warns: “One line in the wrong place can spread across regions.”
In all cases, the provider’s failure also rippled into many services since so many companies depend on the same infrastructure. According to Upadhyaya, this common thread is important, underlying how much risk we face from our own automation.
“All three were triggered by valid-looking changes inside highly automated systems, and they cascaded very quickly because those systems sit at the foundation of so many other services,” he says.
Is seasonal load a factor?
While all three outages occurred in autumn, the seasonal pre-traffic holiday load, new AI-driven traffic, and API requests, which increased stress on the cloud and edge networks, have not been raised as a serious cause for concern.
However, Nigel Douglas, head of developer relations at Cloudsmith, notes that these continuous high-load environments are capable of acting as persistent stress tests.
“Crucially, this stress is often what triggers the escalation of a small, latent configuration fault into a global disruption. A bug that causes a system to crash only when it exceeds a certain load or configuration threshold becomes exposed and catastrophic under peak conditions.”
Perhaps the real story is not that these autumnal outages happened, but that resilience may no longer be keeping pace with scale. Cloud networks now resemble high-voltage grids, interconnected, optimized, automated, and vulnerable to unforeseen chain reactions.
Moving forward, resilience is likely to require multi-cloud strategies, offline failovers, and infrastructure that can degrade quietly rather than catastrophically.
Unlock more exclusive Cybernews content on YouTube
Your email address will not be published. Required fields are markedmarked