Netflix Chaos Engineering: Address the uncertainty of distributed systems
Do you live your distributed systems life by the Principles of Chaos? Well it’s time to start. Chaos Engineering is the new way to address the uncertainty of distributed systems at scale and helps to build confidence in the behaviour of the system.
The brains behind Netflix software development have long been lauded for their contribution to open source software, but can now add software methodologies to the list. The WWW has recently been blessed with Chaos Engineering: A technique employed to uncover systemic weaknesses in distributed systems via a series of experiments.
The aim is to have confidence in complex systems put into production, with interactions between individual services of the system noted as the target of what Netflix calls an “empirical, systems-based approach”. They define the weaknesses that require a proactive approach as improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic and cascading failures when a single point of failure crashes.
Chaos in Practice
The series of experiments that can uncover systemic weaknesses are listed as following four precise steps:
- Define a ‘steady state’ of your system that indicates normal behaviour.
- Brainstorm a controlled and experimental version of the system.
- Start introducing disruptions such as servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Look for differences in the controlled and experimental environments of the analysis to conclude what measures need to be taken.
The philosophy argues that “the harder it is to disrupt the steady state, the more confidence we have in the behaviour of the system”. By adhering to these guidelines and uncovering a weakness, developers would have a target for improvement before the undesired behaviour manifests in the system at large.
The living document also lists advanced principles describing the ideal application of Chaos Engineering:
- Build a Hypothesis around Steady State Behaviour
Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behaviour. By focusing on systemic behaviour patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.
- Vary Real-world Events
Chaos variables reflect real-world events. Prioritise events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying,software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.
- Run Experiments in Production
Systems behave differently depending on environment and traffic patterns. Since the behaviour of utilisation can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.
- Automate Experiments to Run Continuously
Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
This new methodology is befitting of Netflix’s already sizeable arsenal of tools used to monitor their cloud operation. Most are part of their Simian Army suite, along with their Chaos Monkey tool for helping applications tolerate random instance failures.