days
-6
-2
hours
-1
0
minutes
-3
0
seconds
-3
-9
search
The three pillars of observability, explained

Demystifying observability

Cory Watson
observability
© Shutterstock / Evannovostro

Let’s clear the mists from observability. Observability is finding out why something is broken, determining the impact of that break, and assessing if the changes we think will fix the problem actually will and do fix that problem. In this article, Cory Watson discusses the three pillars of observability, where to start, and what it can do to change not only your IT incidents but also your work culture.

As the way companies build and operate systems continues to increase in complexity, traditional monitoring and logging are no longer enough for effective operational insight and solid decision-making. Add to that complexity the supersonic speed of change and the rapidly evolving nature of incidents, and IT personnel find themselves inside a perfect storm every day. 

Observability is the light that can help guide IT out of that storm. Before we start following the light, let’s define it. At its most basic definition, observability is about finding out why something is broken, determining the impact of that break, and assessing if the changes we think will fix the problem actually will and do fix that problem. Using techniques and tools to verify behavior, generate hypotheses, and validate fixes, we then gain confidence in the state of our systems. It’s this hypothesis generation, exploration, and validation that pushes observability beyond simple monitoring.

Cindy Sridharan summed up the difference between monitoring and observability perfectly in a blog post on Medium, entitled Monitoring and Observability:

Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. “Observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.

So, why is observability so important? With the right tools and techniques, we can gain deep insight into the behavior of a system—and it is exactly this knowledge that makes us more prepared for the future, for the unknown. While we don’t know what the next incident will be or when it will occur, having that deep understanding of the system enables us to handle it more effectively and make better decisions when it does occur.

This last point is so important that I am going to double down on it: when, not if, an incident occurs. While we all like to think our systems are impenetrable and highly available, the truth is that unknown failure is not only likely, it’s inevitable. Period. And when failure does occur, wouldn’t you rather be prepared for that scenario, so you can fix the problem quickly and return the system to operational? That’s the benefit of having observable systems and of using observability tools. In turn, observable systems—those that are easier for operators to understand and diagnose failure in—improve customer experience through reduced downtime and minimized the incident impact.

    In a cloud native world enamored with microservices and serverless, meet Quarkus – Java’s brilliant response to technologies like Node.js, Python and Go that had proven quicker, smaller and arguably more nimble. Download your free beginner's tutorial written by JAX London speaker Alex Soto.

The three pillars of observability

I like to describe the observability tool chest as a three-legged stool, where each leg is as important as the other in keeping the stool upright. In other words, the three legs work together to provide the information we need.

The “pedestrian” piece of observability, logging, can often be full of unimportant information that you need to sift through in order to get to the gems. Structuring your logs from the get-go helps ensure that when you need certain information, you can find it quickly—and it can be ingested and analyzed by other observability tools.

SEE ALSO: A tour of cloud computing: “Observability is an essential component when successfully operating software in the cloud”

Next up, metrics, the “science” piece of observability that includes precise, meaningful, and valuable measurements of things you know you want to know (e.g., response time). While metrics are easy to visualize, it is also easy to lose context of or have too many of them, so it’s important to be thoughtful and intentional when deciding which metrics to track.

And finally, we come to the “magic” piece of observability that illuminates what is really happening in your code: traces and spans. While a trace is a group of events that happen over time, each of those individual events is a span. And each span contains specific values: what it is (name), when it happened (time), where it came from (source), and, most importantly, why it happened—the cause. This is where the rubber meets the road and you can start to debug or fix the problem.

Where to start?

If you’ve read this far, you’re likely saying to yourself, “Now I know how to get all this observability data, but once I have it, what do I do with it? How do I make use of it and get value from it?” I won’t lie—it can be a bit daunting to suddenly have thousands of metrics and tons of tracing information. But there are plenty of open source tools to get you started and, if you have the budget for them, some really rich observability tools, like the one from SignalFx, that can really up your observability game.

One thought I want to leave you with is that observability isn’t just an IT thing; it can become a culture. This cultural adoption means leveraging observability and reinforcing the mental models it affords wherever possible. For example, you can use observability data to illustrate a system function for new hires. You can also leverage investigative techniques when verifying new behavior, like a feature change. When you look at observability as a company-wide investment, rather than an IT-specific one, its value is virtually impossible to ignore.

Author

Cory Watson

Cory Watson is Director of Technology, Office of the CTO at SignalFx, leading high impact, customer-focused projects around observability and monitoring. Cory started his journey to observability as an SRE at Twitter, and continued on to found the observability team at Stripe. He is a strong voice in the observability community, through OSS, popular tweets, blog posts, and speaking engagements. 

Cory has over 20 years of software engineering experience, and an active founder/contributor of several successful Open Source projects. Before finding his passion in observability, he worked in several industries such as e-commerce, consulting, healthcare, and fintech.


Leave a Reply

Be the First to Comment!

avatar
400
  Subscribe  
Notify of