Making reliability easier

Observability and approach – what does good look like?

Ben Newton
© Shutterstock / Lena Lir

From the starting point of system reliability or business objectives, we can make a better case around how observability data can improve software development over time. What are we doing today and how can we do better?

Observability has been a big issue in developer discussions over the past year or two, as more and more people feel they need the insight into what their applications and infrastructures are doing. However, beyond the normal equation used to explain observability as being the combination of logs, metrics and tracing, there is little consensus on how best to use this data across teams.

A big part of the problem here is that the initial questions around observability all tend to be on defining what it is and what it is not. However, this does not answer the question about what observability is ultimately for. From the starting point of system reliability or business objectives, we can make a better case around how observability data can improve software development over time.

SEE ALSO: Security at the root: The need for a new digital paradigm

Taking this approach, we can look at how to reverse engineer what we want to achieve as teams, and then look at how to plug data in to support those goals. Alongside this, we can look at how to make reliability around the systems we support easier.

What are we doing today?

Most developers today are creating and supporting more complex applications than they would have done in the past. This is due to a range of environmental factors – we have cloud providers that can offer more resources than we can ever usefully consume with a single application instance; we have more options open to us around where to run the applications we create; we can use design approaches like microservices to try and simplify how we expand and evolve software over time; and we can use options like serverless to get rid of the problem of hardware or infrastructure completely if we choose.

Whatever approach we choose to take, these applications will produce more and more data that can be used for observation and for reliability. Google’s work around Site Reliability Engineering (SRE) has led the way in defining approaches to running large scale applications and sites.

By focusing on service level indicators and objectives, teams could design services that meet business objectives and keep up with the huge amounts of growth that online services both require and cause. Monitoring has a role to play in SRE, providing insight into what is taking place across applications and letting teams know what steps are needed, if any.

Observability has a role to play within SRE projects, but SRE can also help inform our thinking. Looking into the flow of information that comes through can tell us more about what is taking place, and this data can be used for troubleshooting and problem-solving. However, what observability can’t do on its own is define our goals. We have to do that based on internal goals like reliability – is my application available and performing well? Alternatively, we might use business goals – is my site serving customers well, for example – and then map these back to things that we as developers can control.

Starting from this initial definition helps to set a benchmark for what “good” looks like, either in reality or in principle. By setting this out at the start, it provides its own objective to either measure against or towards. For a business goal, this will involve translating between a business objective and something that can be measured in technical terms. For example, an eCommerce site can capture many data points from time spent on the site through to shopping cart abandonment rates and marketing campaign responses.

Getting the right metrics to track from an observability point of view should involve looking at areas like performance of application components against demand from customers.

For example, we might look at application loads and response times to see how the service performs when 1,000 customers are active compared to 10,000 or 100,000 concurrent users. Ideally, our applications should scale up and support those different load levels without strain.

SEE ALSO: The badass trio: Agile, JavaScript, and startups

How can we do better?

For teams involved in observability, getting data that helps them should be the prerequisite. Focusing on a specific requirement like reliability improvement can help that goal, as it puts the emphasis on specific and achievable actions that can make application development and management easier, and ultimately enhance the user experience.

By using data in this way, we should be able to see where work is needed, where there are design assumptions that we should change, and where our work can make the most difference.

Ben Newton
Ben is a veteran of the IT Operations market, with a two-decade career across large and small companies like Loudcloud, BladeLogic, Northrop Grumman, EDS, and BMC. Ben got to do DevOps before DevOps was cool, working with government agencies and major commercial brands to be more agile and move faster. More recently, Ben spent 5 years in product management at Sumo Logic and is now running product marketing for Operations Analytics at Sumo Logic.

Inline Feedbacks
View all comments