What does container-native monitoring really mean?
© Shutterstock / weerasak saeku
Containers! We all love them. But what does it mean when we talk about container-native monitoring? Peter Arijs explains more about how you can successfully keep track of your containers.
In a previous article, we wrote how “container-native” monitoring could help your DevOps initiatives succeed. Terms as cloud-native, container-native, etc. are thrown around frequently these days. But what do they actually mean? In this article, we describe what we mean by container-native monitoring.
At its heart, container-native monitoring means monitoring that is adapted for dynamic container environments and addresses the specific challenges of full stack visibility in such environments. That’s still pretty vague, so let’s go into some aspects in a bit more detail.
Individual containers don’t matter unless they do
In a cloud context, the metaphor “Pets vs. cattle” has often been used, where traditional servers behave as pets with names that you keep and care about, while in the cloud you are dealing with dynamic instances that are easily replaceable and behave as cattle. Containers are really an extension of that. They come and go (and that is a good thing!), such as for deploying or updating services, scaling out, etc.
But just as with cattle, it’s not the individual animals that matter but more the herd and the purpose they serve. This purpose is the “services” in a container context. So container-native monitoring (as strange as it may sound) should not focus so much on monitoring the individual containers, but first and foremost the services they offer. When a service is having issues, you want to be notified automatically about it, and at that point, you want to have the ability to drill down to the container level if needed.
Of course, you want to have the ability to quickly identify problematic containers. Let’s say there are 10 containers backing a service, and one of the containers is handling requests at the latency that is twice the latency of other containers in the service. You want to be notified about this different behavior and have a detailed look at that container and its surroundings. For example, there could be another container on the same node that is draining the disk I/O and causing this slowdown.
The way we handle this at CoScale is by collecting individual container resource statistics and relating this to service level information we get from the orchestrator platform. We then provide specific visualizations on the performance of individual services and the containers for that service, using a topology dashboard such as below, where you can drill down into the individual containers. We also highlight problematic containers and automatically notify of abnormal behavior. For this, we use anomaly detection techniques, both at a services level (taking into account seasonality) and at a container level (comparing containers of the same service).
Containers only expose what needs to be exposed
The way containers operate poses specific challenges for collecting metrics from them. There are various ways to deal with this. You could start exposing ports, mounting volumes, etc. to expose information for a container to the external world. This is not only cumbersome but also has security issues. For example, when you want to expose a JMX connection to access the stats interface, potential malicious actions could be triggered via JMX. So ideally you want to keep your JMX connection local to the container, which is what containers are made for.
Another alternative is to start packaging monitoring agents inside your containers. Besides the extra overhead, this also breaks the immutability of containers and is not compatible with limiting containers to a single process.
The way we deal with this at CoScale is by using a single monitoring agent per node that works with a set of plugins to monitor both container and orchestrator metrics, as well as the services inside each container. The agent will start the plugin within the namespace of the container to make sure that the plugin has the same view as the application running inside the container. This makes sure that you don’t have to expose anything, and this approach works out of the box.
Accessing container log files for data retrieval
Logs are often a good source of information to derive metrics from. There are multiple options for writing and storing log files to in a container environment. Like with monitoring you don’t want to put an agent inside your containers to gather logs or have any reference to your log aggregator inside the container. Logging should be handled by the platform.
The most effective way to get the log data from your container to the outside world is through /dev/stdout. The platform then picks up those logs and pushes them to the log aggregator. This makes log access and aggregation easy and direct, and it makes sure your containers rely on a single process and don’t require background workers or cron jobs to clean up their logs.
CoScale has support for extracting metrics and events from logs pushed to /dev/stdout and /dev/stderr. However, if your containers do have multiple log files (eg access logs, error logs, etc) within the container, the CoScale plugins can be configured to extract metrics and events from those log files.
All containers are created equally, expect they’re not
Containers use environment variables for initialization, connections, etc. Stateful containers such as database containers (e.g. postgresql, mysql) also use environment variables to initialize the database if it was not initialized yet. These environment variables have to be taken into account to properly monitor these containers and the services that they are running, so your monitoring solution should understand this.
Some CoScale plugins require credentials to gather metrics for the running services. The environment variables provided to the container can be used in the CoScale plugin configuration. For example, a Postgresql container takes PG_USER and PG_PASSWORD environment variables, in the CoScale Postgres plugin configuration you can use $PG_USER and $PG_PASSWORD as the credentials to connect to the database. When the CoScale agent detects a Postgresql container, it will know to use the environments variables provided to that container as the credentials to fetch the Postgresql statistics for that container.
This way images don’t have to be changed to include fixed credentials just for the purpose of monitoring them.
Deploy your monitoring the same way you deploy your services
Since your services are running in containers, it makes sense to do the same with your monitoring agents. Some monitoring tools will require you to install an agent inside your container, or in a sidecar container, often resulting in extra overhead. In addition, having to package an extra monitoring agent in your containers is not really something developers are very fond of, as it breaks the single purpose of containers. Deploying your monitoring agent in its own container is a more container-native solution that makes it just as easy to deploy your monitoring than your other containerized services. In addition, using concepts such as deamonsets and helm charts, you can quickly deploy a monitoring agent with the right configurations on every new node that you deploy.
The way we deal with this at CoScale is by running our monitoring agent inside a container. These containers can be deployed for various orchestrators and container platforms. We have integrations with, Kubernetes, OpenShift, Docker Swarm, Google Container Engine, etc. We then run our various plugins for in-container monitoring directly in the namespace of the recognized containers to extract the relevant metrics.
Monitoring based on container images
Container-native monitoring also means that the container images define how to the monitoring should be done, and this without needing an agent inside the containers, and without requiring references to the monitoring specifics, credentials, etc. Each image runs a different service or a different version of it, and these can all have different monitoring requirements. For example, an image running an NGINX webservice will have different monitoring metrics than an image running Redis.
Container-native monitoring has a lot of facets to it, and the list above is not exhaustive but gives a good idea how to leverage the core principles of container technology for monitoring. This includes the way information is accessed in a container environment, the way monitoring is set up, and the way in which low-level metrics are translated into actionable insights.