What does “full stack” monitoring mean for container environments?
We all know what “full stack” means. But what about when we’re talking about monitoring container environments? Things get a little fuzzy. In this article, Peter Arijs explores the different aspects and challenges of obtaining full stack visibility in such an environment with many moving parts.
What does “full stack” mean anyway?
The term full stack developer was coined in the early 2010’s to identify developers with a broad set of skills throughout the application stack. This includes a combination of front-end and back-end application components, even up to the infrastructure layer with the emergence of “infrastructure as code”. The ongoing trend of containerized applications, using many different application components or microservices, have added to the complexity of the modern application stack. Even up to the point that some have criticized the term full stack developer.
While it is probably unrealistic for a single person to know the development details about every part of the application (unless very simple), it is typically desired to have visibility in all layers of the stack as the application is running in production. This allows to identify issues quickly within the appropriate part of the application or infrastructure and act accordingly. So in this context, let’s explore what the term “full stack” visibility or monitoring means for a containerized application. For example, what does the stack typically look like? What are the relevant metrics at the different layers of the stack? And what functionality is required to collect and analyze of all these metrics?
How does a container stack look like?
I frequently use the picture below in my presentations to illustrate what are the most important layers in the stack of containerized application. And to discuss some important differences between traditional monolithic applications. Indeed, with the use of containers and typically also some orchestration platform, extra layers of abstraction are introduced. It now becomes important to collect metrics from all these layers and tie them together, in order to fully understand how a containerized application behaves.
What metrics to collect?
Looking at the picture above, in order to get full stack visibility of our application, we need to collect performance metrics from the following layers:
- At the infrastructure, we want to collect the different resource metrics such as CPU, memory, disk, network, etc. This could be either from physical or virtual servers or cloud instances. In the latter case, these metrics can typically be accessed via some sort of API (e.g. Amazon Cloudwatch), as well as other metrics from the services we are using on the cloud platform.
- Typically an orchestrator is used to help with the deployment, scaling, and management of containers on the infrastructure. Kubernetes (or flavors thereof such as Red Hat OpenShift) and Docker Swarm are some of the most popular technologies. At this layer, we want to understand container counts and container dynamics such as scaling events. From the orchestrator, we can also gather service definitions and relationships about how containers are tied to services. This allows us to report at the service level, such as the number of containers or other relevant metrics for a particular service.
- For the containers themselves, we also want to understand resource metrics, both per container and per service, as well as container life cycle events. In addition, we want to understand how the applications inside our containers are behaving. This so-called in-container monitoring provides us with application specific metrics for the different services running inside the containers. For further details, read more about in-container monitoring.
- Finally, we want to see the impact on our end users and understand the performance they are getting as consumers of the application. This typically includes front end metrics such as page load times, errors, etc. Sometimes even business metrics can be added to “monitor what really matters”.
Gathering the different metrics from these layers is already a challenge on its own. Most monitoring tools only focus on a subset of this, because they were developed for traditional monolithic applications. Modern container monitoring tools should have integrations with all the layers mentioned above to offer the complete picture and to prevent blind spots.
But it doesn’t stop at just metrics collection. There are some other important considerations related to the way metrics and events are collected.
- Automatic instrumentation: Given the ephemeral nature of containers, it is crucial that new containers are automatically monitored when they are started. This includes recognizing that a new container has been started, as well as the services running inside, and how these should be monitored. At CoScale for example, we use a rich library of plugins to monitor application specific metrics from known services such as NGINX, Redis, MongoDB and many others.
- Also when new nodes are added to a cluster, it is important that these are equipped and configured with the right monitoring agent and settings, such that your monitoring can scale with your environment. This can be done using the concept of DaemonSets in Kubernetes or global services in Docker Swarm.
- Another prime consideration is where the monitoring agents are running and the overhead they generate. This is especially relevant since containers are lightweight and immutable constructs that should be impacted as little as possible. Some monitoring tools require an agent to be added to your container image or as a sidecar container, which often adds significant overhead. Other tools such as CoScale only require a single agent per node (often running it its own container), adding minimal overhead.
- Gathering data is one thing, but making sense of it is a different story. In order to get the right insights, the right type of visualizations for container environments are needed. A jam-packed dashboard with line charts of all resource metrics of all containers is not very insightful. You typically want to start with a high-level view of the health of your services and your cluster, and then be able to drill down when issues occur.
- Also the detection of the issues themselves can be challenging. The number of containers and services, and the number of metrics that they each generate is already resulting in a data deluge. Combine this with the dynamic aspect of containers, and you can see why classical alerting techniques often fall short. More self-learning analytic techniques such as dynamic baselining and anomaly detection are therefore very valuable in such environments, and help with the proactive detection of issues.
- Finally, next to the detection of issues, also fixing them should be facilitated. For this, the right amount of contextual information needs to be gather for troubleshooting. This includes correlation of other events that occurred at the time of the issue. E.g. were all containers of a specific service impacted, or just one? And where there also issues with the downstream services? Also more detailed log data or tracing information can help to troubleshoot the problematic services.
Full stack monitoring for container environments is a different beast than monolithic application monitoring. Classic monitoring tools often cannot provide the right insight in all the different layers, and have a hard time dealing with the scale and dynamics of container environments. Whether you plan to use an open source solution or a commercial offering, the different considerations above can help you selecting the right tool to ensure complete visibility in your environment.