More containers means we need better system visibility
Containers are the foundation for much of modern computing, but they also obscure where problems initially develop. In this article, Mark Herring explains why capturing important metrics becomes difficult in a container-based infrastructure, why managing metrics storage needs to be done properly, and how new data platforms fulfill the needs of the modern web.
In ancient times when client/server or mainframe computing was the norm, it was easy to collect and analyze metrics from the systems. At the time, it was simple to for developers to determine if the server needed more memory or if storage capacity was running low. There were few measurements determining the overall performance of the system back then. Those times seem so quaint today!
As we have gone from a hardware-centric style of computing to a virtualized world, there is an increased need for better visibility into what is really happening inside of containers, pods, and node clusters. Today’s computing is a complex environment that requires continuous orchestration and management. As the old saying goes, “You can’t manage what you don’t measure.”
Docker containers have become wildly popular for various reasons. One reason is that containerized applications are easy to scale; containers can be added or subtracted quickly from an environment. Developers typically use Kubernetes to organize their containers, putting them into pods, then putting the pods onto nodes, and moving things around to balance the loads. Developers also spawn new pods when the application is in demand and kill them when demand winds down. There are so many moving parts – perhaps thousands or tens of thousands of pods and containers – and everything is ephemeral, lasting just a short time and before disappearing for good. It’s a very dynamic environment, which creates challenges for maintaining visibility.
Capturing and analyzing the metrics that matter becomes an essential step to managing the environment. It’s impossible to troubleshoot problems without metrics. For example, metrics tell us why an application is failing. Is it a node failure? Is it a problem with a Docker container? Is it a problem with a pod, or with the orchestration process? Correlating some of the metrics behind this scenario can point to the origin of the problem. This is important because when pods are deploying and working well, the application is resilient to failure and resilient for increasing demand when there is a need to scale.
Measure what matters
Docker and Kubernetes both provide metrics APIs. For example, a partial list of pod metrics can be found on GitHub. There’s certainly no shortage of metrics that can be captured. What’s important, however, is for a developer to decide which pieces of information matter to the business. Different metrics might be needed based on the type of pod being measured. If it’s a pod that’s doing an HTTP listener, the developer is probably interested in the pod’s uptime measurement, what its throughput has been like, and maybe how long it has been alive. If it’s a pod handling a credit check service, the metrics that are important might be response time and the queue of people waiting on the service. What metrics to capture and keep depends on the application and its significance to the business.
Not only are these measurements important to the orchestrator that’s managing resources, but they also can be tied to a business SLA. Suppose a company is running an e-commerce website. The business SLA says that when customers click to refresh the screen, it shouldn’t take more than a tenth of a second for the refresh. There are a lot of things that have to take place between the time when the refresh key is clicked to the time when the refreshed screen appears. The developer would want to set the start time and end time as metrics, and then look at the dozen things that happen in between those points in time. Each of those activities would have associated metrics as well. It’s up to the application provider to decide what information will tell the story of the total refresh time and then capture and analyze those metrics. Metrics such as these are really just time series data or data that has a timestamp as part of the data. The analysis is all about visibility of change over some time boundary. This time series data has unique properties that make it very different than other data workloads, including high ingestion rates, real-time queries, and time-based analytics.
The importance of the metrics store
Capturing the information for these time series measurements creates an increased instrumentation workload across the numerous microservices, containers, pods and nodes that may be involved. This begs the question, where will these metrics be stored and for how long? There are tools that create a metric store in a container on the node cluster. This is convenient, but what happens if the container or node goes down and the store is lost until it rebuilds itself? All of the metrics are lost forever because they were stored in a temporary place. While this might seem trivial, it is of high significance for a regulated business like a financial services firm, whose performance metrics are important from a regulatory control perspective.
For applications with a high availability requirement, the need for availability extends to the metric store. This is because the data in that store is orchestrating much of what the application does. If the store fails, the orchestration fails and so does the application. Containers can be ephemeral but the metric store cannot be. The metric store should be a purpose-built time series database that facilitates taking action, such as orchestrating the environment and providing thorough visibility of the entire environment.
The architecture of the metrics store
Computing infrastructure and architectures evolve based on new demands and needs. Existing technologies are often just not good enough to meet these new requirements. Consider these potentially thousands of pods and containers generating hundreds of thousands of metrics per second that all need to be stored, compressed, and acted upon in real-time. Traditional SQL and Non-SQL data stores are inadequate to meet these new demands. New modern data platforms are constructed to support time series data and these kinds of metrics. In particular, they are designed to address these unique needs, including the following requirements:
- Designed for real-time – The modern world is mercilessly real-time. Users need and want real-time, immediate access to all of their data. They need to identify patterns, predict the future, control systems, and get the insights necessary to stay ahead of the curve. Data should be available and query-able as soon as it is written. Business decisions demand immediate results. A system that is not real-time ready is not poised to succeed in the modern world.
- Designed for automation – Basic monitoring is too passive. You can’t manage what you don’t monitor, but advances in machine learning and analytics make automation and self-regulating actions a reality. A modern system must be able to trigger actions, perform automated control functions, self-regulate, and provide the foundation for performing actions based on predictive trends.
- Cloud scale – The world demands systems that are available 24/7/365 and can automatically scale up and down depending on demand. They must be able to be deployed across different infrastructures without undue complexity. They need to make optimal use of resources; for instance, keeping only what is needed in memory, compressing data on disk when necessary, and moving less relevant data to cold storage for later analysis.
Containerization and the new mode of agile development are a boon for business applications. But with so many moving parts in the dynamic environment of containers, tight orchestration and management are critical for the smooth operation and excellent performance of an application. This creates a greater need for better visibility, requiring developers to utilize all of the application components to collect and analyze performance metrics. This visibility can be achieved only by implementing new time-series data platforms designed for the volume and real-time actions required by these metrics.
There are several steps a developer can take to maximize the benefits of the metrics in his or her applications.
- Decide what metrics matter to the business and instrument the necessary elements to capture those metrics.
- Select a metric store that can properly care for the data that is needed, for as long as it is needed.
- Utilize the metrics to take action to manage the application’s performance and availability.
There’s a corollary statement to the old saying “You can’t manage what you don’t measure.” That corollary is, “What gets measured gets managed, and what gets managed gets done.”
This article is part of the latest JAX Magazine issue. You can download it now for free.
Have you adopted serverless and loved it, or do you prefer containers? Are you still unsure and want to know more before making a decision? This JAX Magazine issue will give you everything you need to know about containers and serverless computing but it won’t decide for you.