Cloudflare outage destroys cloud resilience myth, but what other choice do we have?

Recent failures in cloud compute, global edge networks, and endpoint security have exposed how complex digital systems really behave under stress. Experts say the issue isn’t that hyperscalers are unreliable, but that their resilience has limits we rarely acknowledge.
When Cloudflare stumbled last week, large parts of the internet wobbled with it. LinkedIn, X, and ChatGPT were inaccessible for several hours. Although Cloudflare isn’t used by every website, it has become a major component of the internet’s infrastructure, especially for popular, high-traffic, and security-sensitive websites.
According to W3Tech, among websites that use a reverse proxy, which you can think of as a concierge that directs visitors, ensuring both speed and security, Cloudflare has an 81.5% market share (roughly 8 out of every 10 sites).
It also shows that Cloudflare has maintained a dominant position among reverse proxy providers for years. Moreover, stats further reveal that of the websites that use Cloudflare as a reverse proxy, 99.1% of them also use Cloudflare’s own server infrastructure.
So when Cloudflare has a bad day, the internet feels it.
Just weeks earlier, AWS suffered an outage that took out major social media platforms, banks, and government portals.
And last year, a defective CrowdStrike update sent Windows systems into a reboot loop, grounding flights and disrupting financial services worldwide.
For an industry that promises near-perfect uptime, the last few months have raised an uncomfortable question: how resilient is the cloud, really?
It’s a fair question. Cloud providers often present their global distribution and redundancy as an almost infallible shield that can keep the internet running through anything. But these outages paint a different reality.
Experts we spoke to suggest resilience is no longer a fixed property of the cloud, but a moving target. This is thanks to a deeper architectural turn that they believe strays far from the internet’s founding principles.
“The internet was designed as a decentralized system with extreme reliability coming from that, but in modern deployments, we have an increasing concentration of resources and the loss of reliability which comes from that,” says Peter Zaitsev, founder of Percona, an open source database software, support, and services company.
The real nature of cloud resilience
Zaitsev notes that hyperscale systems rely on enormous distributed architectures that are "extremely complicated" and “end up having those ‘one in a million’ edge cases” which only emerge when run at production scale.
In other words, resilience isn’t being eroded, but is simply being outpaced by the very scale of the systems it’s supposed to protect.
Andrew Jenkinson, CEO of Cybersec Innovation Partners, argues that structural weaknesses often lie deep in the foundational layers, such as DNS and identity-related services.
He adds that these outages underscore a structural truth: that even the most sophisticated providers remain vulnerable in the foundational layers of their infrastructure. The systems are "deeply interconnected," and weaknesses in one component cascade into widespread disruption.
“Distribution does not always equal resilience,” he says, pointing to incomplete DNSSEC adoption and exposed or lightly governed assets, and reliance on sprawling legacy configurations, as some reasons that introduce fragility.
He reframes the problem as a reality gap, not an engineering failure: hyperscalers promote redundancy, yet many critical layers remain brittle.
Together, the experts believe that cloud resilience is a genuine ability, but one that depends on specific circumstances and conditions.
The invisible complexity
Zaitsev is quick to clarify that hyperscalers remain extraordinarily reliable.
“Few organizations running their own data centers… would be able to achieve a similar level of uptime,” he notes.
But reliability doesn’t mean invincibility. And when centralized, tightly coupled systems fail, they fail big.
Zaitsev warns that we haven’t yet witnessed a truly catastrophic event, one that could expose or even wipe a cloud provider’s customer data, which would be far more damaging than temporary downtime.
He also explains that as cloud services centralize control within small internal teams, they introduce new risks. A single mistake, misconfiguration, or flawed internal process can trigger large-scale failures.
Have thoughts about this topic? Others do, too. Join them in the discussion.
Jenkinson, meanwhile, doesn’t accuse hyperscalers of overconfidence. Instead, he argues it is the industry that underestimates the complexity required to deliver true end-to-end resilience.
That complexity is invisible to customers, who often assume redundancy protects them from any failure. But redundancy only covers the failure modes it was designed for, and not the ones that emerge from the unpredictable interactions of distributed systems.
Preparing for the future
If resilience isn’t automatic, how should organizations prepare?
Jenkinson stresses the need for transparency. Most companies don’t know their cloud vendor’s DNS security posture, dependency chain, or configuration hygiene. Stronger accountability, such as mandatory reporting on DNSSEC usage and internal controls, he says, would give companies a more realistic view of their exposure.
He also warns that widespread outsourcing has hollowed out internal skills in DNS, routing, and distributed-system architecture. Without rebuilding that expertise, he argues, organizations will keep creating more failure points, even when they try to spread workloads across multiple clouds.
In the same vein, Zaitsev notes that while a multi-cloud approach makes sense, it introduces significant additional complexity. Instead, he suggests that having a disaster recovery plan where workloads and data backups can be recovered on another cloud is a better approach than running everything on multi-cloud.
“The best way to achieve this is to use basic cloud services and rely on open source software for the rest, as it allows you to deploy your environments anywhere easily,” he recommends.
Jenkinson agrees, saying multi-cloud and multi-DNS architectures can improve resilience, but only when matched with the necessary operational expertise.
“Until organizations rebuild internal skill sets and demand higher security baselines from vendors, the cycle of outages, exposure, and opportunistic cyber-activity will continue,” stresses Jenkinson.
These outages don’t show that the cloud is unreliable. They exposed a lack of understanding about what resilience actually looks like at hyperscale.
The cloud remains staggeringly reliable, but it is also increasingly complex, increasingly centralized, and increasingly opaque. If you believe the experts, resilience today depends on acknowledging those realities, not assuming the cloud’s architecture guarantees safety.
As Zaitsev reminds us, the internet’s original design accepted that anything could fail, but nothing should bring down everything else with it. Relearning that mindset may be the most important step toward building a more resilient digital world.
Unlock more exclusive Cybernews content on YouTube