Monitoring microservices with health checks
What kinds of monitoring solutions are available to check the health of your microservices? In this article, Peter Arijs explains what you are looking for, goes over the different types of service checks, and gives an example of how you can do this with CoScale’s monitoring solutions.
There is a lot of talk about monitoring and observability in distributed systems these days. While I don’t want to get into the semantic discussion, it’s increasingly clear that people are looking for tools to offer them visibility and understanding of how their complex container and microservices environments work. And more importantly, why they suddenly stop working as expected. To achieve this, there are basically two ways of obtaining valuable data from your systems.
Black box vs. white box
A first way is to gather the data that your system or service is generating as it is running, representing its internal state. These are collected from log files, metrics and counters from interfaces, application traces, code instrumentation, etc. This is sometimes also referred to as white box monitoring. This is what we also do at CoScale with our application plugins that gather metrics and events from services running inside containers.
A second way is black box monitoring, where you are basically performing a number of tests or checks to understand how your service is behaving locally or from the outside, as a user is seeing it. That user could be an actual end user, or another service that is calling the service.
Black box monitoring gives you an idea of the state of a service, or even a complex interaction of different services.
White box (internal) and black box (external) monitoring are highly complementary, and that is why you typically find both in modern monitoring and observability tools. See also this blog post on monitoring distributed systems, explaining how Google effectively combines both approaches. In the remainder of this post we delve deeper in the topic of setting up health checks for the purpose of black box monitoring.
Service health checks
What are health checks?
Health checks are basically endpoints provided by a service to check whether the service is running properly. It is advised to create multiple specific health checks per service, this has several benefits:
- Specific health checks allow the user to pinpoint the malfunction
- Latency measurements on specific health checks can be used to predict outages
- Health checks can have different severities
- Critical: the service does not react on any requests
- High priority: a downstream service is not available
- Low priority: a downstream service is not available, but a cached version can be served
Why should I have health checks?
All services should implement health checks. These checks can be used by orchestration tools to kill a process in case of a failing “Critical” health check. Health checks can also be used by monitoring tools to track and alert on the availability and performance of the service, where they serve as early problem indicators.
What to check in health checks?
The most important cases to check are:
- Is the service responding to requests as expected?
- Are the downstream services available?
- Perform an end-to-end transaction to verify whether backend services are working
We will elaborate on these topics in more detail in the types of health checks section.
How to document the health checks ?
Health checks can perform complex operations like end-to-end transactions to see whether the complete system is working. In order to make this information useful to everyone on the team, it is important to describe what the health check does and which services are involved in the health check.
When a health check goes red, it should be possible to easily determine which services might be causing the issue at hand.
Types of health checks
Local checks are health checks that only use resources that are available on the service locally, they don’t perform any external calls.
The most simple local check is a request that returns a fixed message. This might seem like a silly health check but if this health check fails
- We know there is fundamental problem with this service
- This might be caused by garbage collection, thread starvation and others
- This is certainly a critical failure and appropriate action should be taken
Another example of a local health checks is checking whether the files required for the service are properly mounted, readable and consistent.
Very few services are able to get anything done without communicating to other services (we call these downstream services). The downstream services might be services that are being run by the same team or a different team, in the same cluster or remote, in the same company or a public service.
A few issues that can arise when talking to external services
- Errors or high latencies on the downstream service
- Rate limits on downstream services
- Unexpected API changes that break the communication
Multiple health checks can help determine what the issue is. Having a separate checks to check:
- Whether the target host can be resolved
- Whether the port on the target host is opened
- Whether the a connection can be created to the target host
- Whether a full request to the service can be completed
This makes it easy to pinpoint whether the issue is in the DNS, firewall, external network or service and who should be involved to mitigate the issue. This is the ideal situations, but in practice most teams starting off with a single health check that performs a request to the external service.
These health checks are a bit like checking external dependencies but go one step further. Instead of making one call to check whether a service is available, a real business workflow is executed. This workflow can be made up of multiple calls, for example inserting data and afterwards checking whether the data is available.
This scenario is often used when there are backend services involved that do external processing. For example, image a service that receives data, sends it asynchronously to a worker which in his turn pushes the data to another service. Checking whether the end-to-end behaviour of this workflow is working properly, can be done by execute a data push and retrieving data in a later health check. The whole process might take a few minutes, but since every health check pushes data, the data should become available within a certain SLA.
This type of health checks verify whether the overall behavior of the system is working, it can detect tricky malfunctions that would otherwise only be visible in real customer transactions.
As an example, we consider a simple API-based application using Nginx to insert data in Elasticsearch. We could implement the following health checks for this application:
- API login health
- Action: performs a login on API
- Depends on: Nginx, API
- API event health
- Action: insert dummy event in Elasticsearch and reads it back
- Depends on: Nginx, API, Elasticsearch
For example if only check 2 goes down, you can easily see that there is an issue with Elasticsearch, since the other downstream services are also checked in the first health check.
Integrating health checks in CoScale
CoScale offers a solution for monitoring containers and microservices. We use an agent running in its own container that works with a set of plugins to recognize and monitoring images running known application components.
Next to that, CoScale offers several mechanism to implement health checks for the services running inside your containers. For all of the health checks we measure whether the check is successful and the latency for performing the health check. The latency of a health check can give an early indication that a system is degrading and might precede an actual outage.
These health checks are integrated as so called active checks in our standard plugins for HTTP, database, queuing and other services. Below you can see an example of these active checks for a our Nginx and RabbitMQ plugin.
For generic HTTP endpoints, there is also an Endpoint checker plugin. The plugin configuration consists of the HTTP URL that will be checked. The HTTP URL is fetched every minute, if the status code is 200 the uptime is set to 100%, otherwise the uptime is set to 0%. When looking at data for longer periods (eg. a month), it is possible to see the uptime for the service in %. As mentioned before, besides the uptime, the latency is also gathered.
For other endpoints, the Generic script plugin is available. This plugin allows the user to execute arbitrary scripts (in any language) to create metrics.
- Ping check, port check, https certificate check, …
- Postfix, NFS checks
- Docker end-to-end tests
Visualizing and alerting on health checks in CoScale
Once you have added these health checks, you can also use them in your CoScale dashboards. One useful widget is the Uptime widget. This shows a nice graphical representation of when the service was available. Multiple checks and services can be visualized on a single widget. Next to that events can be used to mark periods of maintenance or expected downtime.
The latency metrics of these health checks can also be used in other widget types such as tiles or graph charts, as shown below.
And finally, you can of course also set alerts on these health checks to get notified when a service goes down or does not respond as expected.
Health checks allow you to monitor and test your services in production to get early notifications and pinpoint problems with your services. Health checks typically help you to triage a problem with a service, while for the detailed troubleshooting it is useful to combine these health checks with more detailed internal instrumentation of your services, like we do at CoScale.