Control and responsibility in deep systems – interview with Ben Sigelman
As deep systems become more prevalent in today’s high-speed digital enterprises, DevOps must find the best tools in order to manage them. We spoke with Ben Sigelman, co-founder and CEO at LightStep, about how deep systems emerge, microservice observability, and the difficult aspects of dealing with deep systems.
JAXenter: Could you tell us a little bit about what deep systems are and how they emerged?
Ben: Deep systems are essentially a symptom of any well-executed digital transformation at any large enterprise. To keep up with – and ahead of – the current demands of today’s digital enterprises, we’ve had to shift our strategy to speed up development processes. At the same time, developer teams have ballooned, in many cases up to 100+ developers for individual business units. As a result, our industry adopted microservices, which enabled large teams to organize into smaller, separately managed units that can operate autonomously from one another.
While microservices empower small teams to work independently of the larger development team, they’re still connected and codependent on other microservices. These, along with monoliths and managed cloud services, make up the independently-managed layers in the end-to-end application stack and determine the “depth” of an architecture. We consider systems to be “deep” when production architectures comprise four or more of these layers. And while deep stacks certainly aren’t new, it’s the independently managed layers of today’s distributed stacks that make deep systems a unique phenomenon – and a growing issue for developers managing large-scale applications.
With deep systems, problems arise when requests propagate from one layer or team to the next. When this happens, conventional tooling and investigations break down. Developers are then forced to spend inordinate amounts of time trying to identify the cause of performance issues and unexplained regressions, wasting valuable time and resources – not to mention contributing to burnout and poor morale among teams.
JAXenter: When did you first come into contact with a deep system and how did you approach your work with it?
Ben: In 2003, I joined Google and stayed for nine years. During my time there I worked on and deployed Dapper across Google’s production systems. In today’s language, we would call Dapper an “observability” tool. It was then that I found web search traces and started expanding them, all the way from the top to the bottom. I discovered that there were more than *20 layers* of microservices from the ingress frontend that handled the query down through the depths of the serving system. The complexity was mesmerizing, but honestly somewhat terrifying, given that the tools available at the time weren’t capable of providing the kind of visibility developers needed to understand a system that deep.
I learned a lot deploying Dapper, and in many ways we were constrained by Google’s truly unique – if not downright ridiculous – scaling requirements. At Lightstep, what we’re doing is far more flexible, adaptable, and powerful, using the knowledge of what we did at Google to create something better.
When it comes to observability strategy, development teams should NEVER consider metrics, logs, and traces as individual product features.
JAXenter: With so many layers of microservices adding layer upon layer of complexity, how do you get an overview of the system and then what’s the next step if you achieve that?
Ben: To be frank, much of the advice around microservices observability is misguided. As an example, many consider “The Three Pillars of Observability” to be metrics, logs, and distributed traces, but those are actually just the raw input data that feed into an observability solution. If anything, I’d consider them “The Three Pillars of Telemetry” – at best.
If we escape from the idea of tracing as “a third pillar,” it can solve this problem, but not by sprinkling individual traces on top of metrics and logging products. While individual distributed traces are necessary, they are rarely sufficient on their own since they only represent isolated transactions. Tracing must form the backbone of unified observability, but it needs to be more than manual inspection of individual traces: only the context found in trace aggregates can address the sprawling, many-layered complexity that deep systems introduce.
When it comes to observability strategy, development teams should NEVER consider metrics, logs, and traces as individual product features. Treating them as separate capabilities mean developer teams will need to have three tabs open (one for each so-called “pillar”) during releases and investigations, which leads to clumsy context-switching and disorientation. The most effective and sensible way to build an observability strategy is around specific use cases.
To get an accurate overview of the system, it’s best to use portable, high-performance instrumentation (e.g. OpenTelemetry or OpenTracing/OpenCensus) to gather traces, logs, and metrics. Then design your observability strategy around use cases that actually matter in observability, like deploying new service versions, improving steady-state performance, and reducing MTTR.
JAXenter: What would you say is the most difficult aspect of dealing with deep systems?
Ben: Without a doubt, the hardest part of dealing with deep systems is the stress from the disconnect between what service owners can control, and what they are responsible for – two vastly different things. This image can help to illustrate this:
In microservices, DevOps controls not only their own service, but the performance and functionality of the entire stack is still their responsibility. That’s why deep systems are immensely stressful for DevOps. And the deeper the production architecture, the wider this chasm grows, and the more stressful it becomes.
With deep systems, problems arise when requests propagate from one layer or team to the next.
In your opinion, will we see more and more deep systems in the future, and what do you think is the best way to deal with them?
Ben: In our industry, developers are instrumental to business success, and as monoliths come to an end, our reliance on developers will only increase. As times change, it’s important to ask if we are approaching managing our systems correctly, and if we’re empowering developers with the tools and knowledge they need to navigate this complexity.
So as deep systems become more prevalent, we need to find the right solutions to manage this. Let’s start by pinpointing the problem: the lack of effective communication across the multiple autonomously-managed levels in one stack. The only type of telemetry that is similar to this multi-service, multi-layer dependency is tracing, so it has to be the backbone of observability in deep systems. For example, if layer one of an application relies on layer ten (with multiple layers likely in between), we need to build a model of the application from the perspective of layer one and take snapshots of the model before, during, and after layer tens hypothetical issue. With thousands of traces per snapshot, observability can then identify where the problem originated, both before and during a faulty update.
Thanks very much!