Why we need Resilient Software Design
Michael Nygard’s Circuit Breaker Pattern has been adopted by Netflix and been established as a central part of Resilient Software Design. Its acclaimed author explains the benefits of Resilient Software Design and why it matters exactly how we fail.
JAXenter: Why is Resilient Software Design so important that we need an extra term for it? Why not just say “failsafe” or “failure-resistant”?
Michael Nygard: When we talk about failsafe, we literally mean that a system fails in a safe way. Resilient systems, on the other hand, recover back to their original operational state. It’s a subtle difference, but an important one. When I was in operations, our main way to restore functionality was to restart things: first the application, then if that didn’t work, the host. Most of the time, we also had to restart anything that depended on the failed component.
In contrast, a resilient system would automatically cut off failing components and reintegrate them once they are no longer failing.
In the introduction of your book “Release It” you write: “We will prepare for the armies of illogical users who do crazy, unpredictable things”. Is this what resilience is all about: embracing the chaos of the real world because you cannot control it and translate this way of thinking into software architectures?
Absolutely. The lack of control is very obvious when we examine user behavior. We also lack control over many parts of our own infrastructure. Operating systems crash. Applications have bugs. Fans, CPUs, disk drives… everything breaks.
When we were dealing with small systems (say a few machines up to a few tens of machines) we could regard “100% operational” as the normal state and any deviation was an exceptional condition. However, as systems scale up, the simple laws of probability dictate that “100% operational” is actually very rare. The normal mode of operation is partial failure.
What businesses and industries is Resilient Design especially important for?
Resilient systems – at least of the kind I write about – accept that failures are normal and that it’s OK to run in a partially failing mode. This makes it unsuitable for life-critical or real-time applications.
They are suitable for an industry where demand is constant and uncontrolled and where individual transactions can be sacrificed without catastrophic losses. This describes most web systems, especially any in the commerce, media, or social sphere.
What are the most common mistakes made in systems that are not resilient?
By far the most common mistake is just to not think about production operations and failures in production. Many systems are built to pass QA testing rather than to survive the world after launch.
How can Microservices help make systems more resilient?
I think this is a topic that needs more attention when people consider moving to microservices. There are subtle issues to address.
I’ll say that microservices can help make a system more resilient, depending on how you decompose your system into services and how you build each service. For me, the question is not about how large the service is, but how large the “failure domains” are. One microservice architecture may in fact be a single large failure domain, whereas another may be split into a dozen isolated failure domains.
There are two dimensions to consider. First, what services are needed for any given feature? I talk about an “activation graph” of services. That is, given a particular request, what services must be involved to fully deliver that request? The more overlap between the activation graphs of all your request types, the less resilient your architecture is. This is why I think that so-called “entity services” are a bad idea; they tend to produce activation graphs with huge overlap.
The second dimension to consider is whether individual services have an easy way to handle failure in their dependencies. The caller must still respond to requests even when some of its dependencies are timing out or refusing connections. This means that resilience should be built into every service, preferrably with a common framework so that monitoring and administration is simplified.
The Circuit Breaker Pattern from “Release It” is now a major principle of Resilient Design. Netflix’s Hystrix is one of the most prominent implementations of it. What is the essence of this pattern?
In your house, a short circuit produces high electrical current that leads to a fire. To prevent that, electricians use a circuit breaker. It detects a dangerous condition (excess current) and intervenes to prevent the catastrophe. The software circuit breaker performs the same function: it prevents a partial failure from becoming a catastrophic outage.
In implementation terms, all calls to an external interface go through a component that can watch for failures in requests to the provider. If too many calls fail, it means the provider is unavailable. Whether that is due to the provider or the network being down is immaterial. From the caller’s perspective, all that matters is that the service cannot be reached.
Without a circuit breaker, every request on every thread would need to “discover” anew that the provider is unavailable, usually by hitting a timeout. This means that a failure in the provider causes the consumer to slow down. In the worst case, there is no timeout and all the caller’s threads end up waiting for responses that will never arrive.
In contrast, a circuit breaker would detect the first few failed calls and flip into a state where outbound requests are quickly refused without even attempting to call the provider. This means error responses are delivered quickly rather than incurring timeouts. The caller’s threads are thus preserved and available for other requests.
Like the electrical circuit breaker, the software breaker interrupts service. It preserves the caller’s availability at the expense of completeness. That means the caller must have some reasonable action to take. Maybe it can deliver a cached response, a partial page, or some other form of degraded service. This requires careful design (especially with microservices) to ensure any meaningful response exists.