Chaos engineering in an age of uncertainty
Many businesses are embracing chaos engineering as the proactive approach to identifying problems.
Nine months ago, our timelines were filled with unusually positive articles about preparing for a life of opportunities as we stepped into a new decade. Futurists dared to gaze into their virtual crystals balls to share their technology predictions for the roaring twenties. But nobody (other than Bill Gates) predicted a global pandemic and the chaos that would ensue.
The more proactive than reactive approach makes it possible to fight fires before they happen and navigate turbulent conditions successfully. Chaos engineering (CE) requires an upgraded mindset to identify failures before they become outages.
The days of shutting down racks and removing a network cable and unwittingly causing an outage somewhere can finally be retired.
But unexpected problems, vulnerabilities, and weaknesses are not always the result of human error. The reality is that every system will be affected by its environment and the random, turbulent conditions that come with it.
What is chaos engineering?
Every business is now challenged with ensuring their online services operate in every time zone—all without downtime to enable users to continue consuming vast quantities of data. The term chaos engineering was made famous by Netflix when they migrated its services from a traditional data center to the cloud.
The streaming giant was forced to deal with the complexities and reliability of its new servers.
Rather than waiting to react to issues, chaos engineering is the act of injecting failure in a controlled manner.
Resilience by design quickly became the latest best practice as the focus shifted to building better apps and websites that were prepared for the inevitable unplanned interruptions on the horizon.
Ultimately, if a customer raises a support ticket, you have already failed. Planned experiments that reveal weaknesses in systems, teams, and processes should help you prevent outages or fix them before users are aware.
What are the business benefits of chaos engineering?
If my Netflix stops working regularly or becomes unreliable, I would probably switch to one of the many other reliable streaming services at my disposal. If Ticketmaster experienced an outage during a Billie Eilish ticket release, the promoter would pass those tickets to another ticket agency. When a website or app goes offline at a critical time, your customers will immediately go elsewhere.
The reliability and resiliency of tech are not just about profit margins. It's also increasingly becoming a matter of life and death.
For example, would you sit inside a self-driving car and put your life in the hands of poorly written code?
The mitigation of risk, expecting the unexpected, and providing customers with the confidence that they are in safe hands or that system will continue to operate safely when things go wrong are table stakes.
Every business will have its own Achilles heel and reasons why downtime in a digital world costs money. It could bring down the virtual shutters on your business's front door when you least expect it. But it doesn't have to be like that. By embracing the chaos, you can learn what might fail, and tweak the design of your system or infrastructure accordingly to ensure you do not suffer an unplanned outage.
How does it work?
Netflix created Chaos Monkey, which randomly terminates instances in production. The brave move forces engineers to up their game and implements services that are resilient to instance failures. The old way of doing things involved engineers being thrown in at the deep end during an outage when it could have been months or years since they had encountered a problem like it.
By introducing frequent failures, CE incentivizes them to build resilient services and increases their familiarity with the infrastructure. A series of regular digital fire drills could expose vulnerabilities and result in reliable and responsive systems throughout your company.
When faced with unprecedented demand on its streaming services, Netflix, YouTube, Amazon, and Disney announced that they would be downgrading their video quality, which would lower its overall bandwidth utilization. This more proactive approach prevented downtime and bad publicity that would have inevitably followed.
Embracing the chaos
The concept of running experiments in a live production environment will be enough to bring out a cold sweat in any IT director or CTO. Diving in headfirst would be reckless, which is why taking baby steps in chaos engineering should begin in a different environment that is as close to the production environment as possible.
Technical teams are in control of the so-called "blast radius." Simulations should be carefully planned to unlock learning opportunities. As your progress in introducing a new best practice of resilience increases, you will feel much more confident about making changes.
Software, apps, and systems will be continuously tweaked and updated to add additional functionality or fix problems. But what do you break in the process? For these reasons alone, it would be foolish to assume that a system will respond to a fault injection test (FIT) in the same manner several weeks from now.
Gremlin's new 'Status Checks' capability also offers peace of mind by automatically verifying that systems are healthy and ready for Chaos Engineering experiments. With Black Friday sales on the horizon, many retailers will be looking to make up for lost ground. But how many have embraced the chaos and learned the lessons of last year's eCommerce site outages?
Chaos engineering should never be seen as the cause of your problems.
It should be seen as a way of revealing them before they result in a costly outage. Behind the unnerving name, CE should be seen as a way of increasing resilience, reducing risk, and delivering valuable lessons about your organization. But the biggest winners should be your customers who will enjoy an improved user experience and remain loyal to your brand.