You can’t see a black swan without AIOps
A ‘black swan’ is a random, unexpected event with monumental impact, either good or bad. In the IT world, a black swan can lead to deep reflection—but also to broad generalizations based on incomplete data. AIOps can help and Ravi Lachhman explains how.
When someone says “black swan,” the first thing that may come to mind is the Natalie Portman movie of the same name. But what pops into my head is The Impact of the Highly Improbable, author Nicholas Taleb‘s definitive take on black swan theory. As Taleb describes it, a black swan event is highly improbable—say, Google’s astounding success or a sudden natural disaster. It has three primary characteristics:
- It’s unpredictable
- Its impact is massive
- When it’s over, we devise an explanation that makes the black swan seem less random and more predictable
The term black swan illustrates the shortcomings of inductive reasoning, the risks of focusing on the things we already know and the failure to consider those we don’t. If, for instance, you see a thousand white swans, you may conclude that all swans are white. However, the fact that you’ve never seen black swans doesn’t mean they don’t exist (they do). Taleb uses black swan as a metaphor for how we deal with unforeseeable events. After a black swan event, humans usually exert considerable effort to determine how they could have predicted the incident. A better response, says Taleb, is to focus on the lessons learned and ways to reduce the potentially catastrophic impact of a future black swan.
Black swan and system design
With IT infrastructure growing more distributed and complex, the awareness and coverage of enterprise systems can be muddied by a variety of factors, including multiple cloud providers for various applications and services; the immense volume of data produced by application infrastructure; and the fog of development, the uncertainty that arises when developers work on a small piece of a system or platform. And while we use disciplines such as chaos engineering—intentionally injecting failure into a system to gage its resiliency—to make our platforms more robust, a black swan can still appear from out of nowhere.
On the plus side, a black swan can actually improve system design. Say, for instance, your primary data center catches fire, resulting in a total, catastrophic loss. Awful, yes, but there’s a bright spot: Several cloud-hosted SaaS components that integrated with on-premise applications are still available. You’ve learned a painful lesson: move more of your critical enterprise applications to the public cloud.
There are, of course, major hurdles in designing systems for the unpredictable, and black swan events impact the science of site reliability engineering (SRE). In her keynote at SRECon19 Americas, author, computer scientist and SRE Laura Nolan explored the taxonomy of a black swan event. Our systems tend to follow certain patterns, Nolan explained, which can lead to cascading failures, capacity limits, and even non-obvious dependencies at the system level. A fundamental tenet of the system design is to eliminate the single point of failure. Since we build redundancy into our platforms, ensuring that multiple, redundant components work effectively together is a key part of chaos engineering. A prime example is Netflix’ Chaos Monkey, a tool that randomly terminates virtual machines and/or containers in production, enabling engineers to implement services that are resilient to instance failures.
Modern workloads and black swans
As our workload and deployment topologies grow more distributed, we may inadvertently introduce dangers we cannot predict—in essence, a black swan. The container boom has led to a rise in container orchestrators and resource managers. And popular open-source orchestrators such as Kubernetes and Mesos allows us to rapidly deploy robust services across disparate environments. In essence, we’re pushing the boundaries of platforms and applications, introducing unknowns that may impact the entire system, such as a security vulnerability that impacts low-level components and leads to an outage. But by correlating disparate events, we can identify the unknown.
AIOps and black swans
Circling back to Nicholas Taleb’s black swan definition, our perceptions and biases can easily sway our interpretation of events. By utilizing big data, modern machine learning and other advanced analytics technologies to, directly and indirectly, enhance IT operations, AIOps enables the system to react and correlate faster than a human operator. As our systems become more distributed and scale demands rise, we can become overwhelmed by data volume and velocity. Organizations of all sizes can benefit from the lens and conclusion-making capabilities of AIOps.