Let’s clear the mists from observability. Observability is finding out why something is broken, determining the impact of that break, and assessing if the changes we think will fix the problem actually will and do fix that problem. In this article, Cory Watson discusses the three pillars of observability, where to start, and what it can do to change not only your IT incidents but also your work culture.
As the way companies build and operate systems continues to increase in complexity, traditional monitoring and logging are no longer enough for effective operational insight and solid decision-making. Add to that complexity the supersonic speed of change and the rapidly evolving nature of incidents, and IT personnel find themselves inside a perfect storm every day.
Observability is the light that can help guide IT out of that storm. Before we start following the light, let’s define it. At its most basic definition, observability is about finding out why something is broken, determining the impact of that break, and assessing if the changes we think will fix the problem actually will and do fix that problem. Using techniques and tools to verify behavior, generate hypotheses, and validate fixes, we then gain confidence in the state of our systems. It’s this hypothesis generation, exploration, and validation that pushes observability beyond simple monitoring.
Cindy Sridharan summed up the difference between monitoring and observability perfectly in a blog post on Medium, entitled Monitoring and Observability:
Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. “Observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.
So, why is observability so important? With the right tools and techniques, we can gain deep insight into the behavior of a system—and it is exactly this knowledge that makes us more prepared for the future, for the unknown. While we don’t know what the next incident will be or when it will occur, having that deep understanding of the system enables us to handle it more effectively and make better decisions when it does occur.
This last point is so important that I am going to double down on it: when, not if, an incident occurs. While we all like to think our systems are impenetrable and highly available, the truth is that unknown failure is not only likely, it’s inevitable. Period. And when failure does occur, wouldn’t you rather be prepared for that scenario, so you can fix the problem quickly and return the system to operational? That’s the benefit of having observable systems and of using observability tools. In turn, observable systems—those that are easier for operators to understand and diagnose failure in—improve customer experience through reduced downtime and minimized the incident impact.
The three pillars of observability
I like to describe the observability tool chest as a three-legged stool, where each leg is as important as the other in keeping the stool upright. In other words, the three legs work together to provide the information we need.
The “pedestrian” piece of observability, logging, can often be full of unimportant information that you need to sift through in order to get to the gems. Structuring your logs from the get-go helps ensure that when you need certain information, you can find it quickly—and it can be ingested and analyzed by other observability tools.
Next up, metrics, the “science” piece of observability that includes precise, meaningful, and valuable measurements of things you know you want to know (e.g., response time). While metrics are easy to visualize, it is also easy to lose context of or have too many of them, so it’s important to be thoughtful and intentional when deciding which metrics to track.
And finally, we come to the “magic” piece of observability that illuminates what is really happening in your code: traces and spans. While a trace is a group of events that happen over time, each of those individual events is a span. And each span contains specific values: what it is (name), when it happened (time), where it came from (source), and, most importantly, why it happened—the cause. This is where the rubber meets the road and you can start to debug or fix the problem.
Where to start?
If you’ve read this far, you’re likely saying to yourself, “Now I know how to get all this observability data, but once I have it, what do I do with it? How do I make use of it and get value from it?” I won’t lie—it can be a bit daunting to suddenly have thousands of metrics and tons of tracing information. But there are plenty of open source tools to get you started and, if you have the budget for them, some really rich observability tools, like the one from SignalFx, that can really up your observability game.
One thought I want to leave you with is that observability isn’t just an IT thing; it can become a culture. This cultural adoption means leveraging observability and reinforcing the mental models it affords wherever possible. For example, you can use observability data to illustrate a system function for new hires. You can also leverage investigative techniques when verifying new behavior, like a feature change. When you look at observability as a company-wide investment, rather than an IT-specific one, its value is virtually impossible to ignore.