A method to the madness

The benefits of chaos engineering-as-a-service

Matt Fornaciari
chaos engineering
© Shutterstock / e-leet

There’s no better way to test something than to break it. In this article Matthew Fornaciari, co-founder and CTO of Gremlin, discusses chaos engineering and how it can help your systems become better. Let loose the gremlin, it’s time to get testing.

For nearly a decade, I’ve carried a pager. First at Amazon then at Salesforce. I was responsible for managing and resolving high-pressure incidents and ensuring our services stayed up and running for customers, no matter what. In other words, I’ve had skin in the game and understand pager pain caused by unforeseen outages and downtime better than most.

Naturally, I became incredibly interested in ways to increase resilience in complex systems, which inevitably guided me to a practice known as Chaos Engineering. Some of you will already be familiar with the concept — but for those who aren’t, the core ideology is that by proactively simulating failure modes in your systems, you actually learn about weaknesses before they have the chance to impact customers. Think of it as a flu shot for your applications and infrastructure. You inject a bit of the bad, in a controlled manner, in order to help the system develop a tolerance.

Before we started building Gremlin, the only publicly available tools for companies to perform Chaos Engineering were open-source, e.g. Chaos Monkey. And while these tools have done a great job at exposing more people to Chaos Engineering, they have some severe limitations: their capabilities and scope are tailored, there is a complete lack of safety or security features, and they are neither easy to set up nor maintain.

SEE ALSO: Netflix Chaos Engineering: Address the uncertainty of distributed systems

So while the roots of Chaos Engineering have been around for over a decade — dating back to Jesse Robbins running through datacenters and unplugging cables — the industry needs more, especially as systems become more complex. Chaos Engineering-as-a-Service provides engineers with what they need to perform experiments safely and securely with the end goal of eventually automating those experiments as a form of resilience regression testing, preventing the dreaded drift into failure.

Why not just build a solution in-house?

If you have the time, money, and expertise to build and maintain a solution tailored for your own environment — by all means, give it a shot. But in most cases, in a world where we are constantly paying for services, why spend the energy and engineering resources on something that’s not your core competency? Chaos Engineering can also be dangerous, so it takes hiring experts to maintain the system and run the experiments safely.

I’ve heard countless stories of companies building their own chaos engineering tools, often inspired by Netflix’s Chaos Monkey, and then freak out and shut the tool down because it triggered a customer-facing failure. But that’s largely counterintuitive to the spirit of Chaos Engineering! Finding the failure is the whole point of running chaos experiments, and the next step (besides fixing the failure) is automating that experiment to ensure the problem has truly been fixed. Having a service that allows engineers to have more control, and to run experiments safely with an “undo” button, empowers companies to extract much greater ROI from their Chaos Engineering practices.

SEE ALSO: Chaos Monkey comes to Spring Boot

At the end of the day, most engineering teams I’ve spoken with want a simple way to get up and running with Chaos Engineering. Building an internal tool, even if it’s forked from an existing open-source project, will require a non-trivial amount of time to build robust features (not to mention sufficiently hardening the tool for security). Chaos Engineering-as-a-Service means you will get an intuitive UI, customer support, out-of-the-box integrations and everything else you need to get experimenting in a matter of minutes.


Matt Fornaciari

Matt Fornaciari is co-founder and CTO of Gremlin Inc. He joined from Salesforce where he was a Senior Platform Engineer. Prior, he improved the reliability and customer experience of the Amazon retail website. He founded the “Fatals” team to analyze and fix customer facing failures, reducing the number of retail website errors by half in his first year.

Inline Feedbacks
View all comments