That's why Kublr built it into their platform from day ONE

True reliability requires self-healing nodes and infrastructure management

© Shutterstock / sunwart

Even the best Kubernetes management solution cannot save you from bad infrastructure provisioning. You can’t achieve true self-healing applications if you don’t have a self-healing infrastructure. Find out what self-healing Kubernetes can and cannot do and how Kublr provides fully self-healing clusters, including master and worker nodes.

Here’s a dirty little secret. While infrastructure provisioning and self-healing are key to highly available reliable clusters, it’s still not standard in some of the most popular Kubernetes solutions. Seems like a pretty big deal, doesn’t it?

But wait, Kubernetes does provide self-healing. And you are right – partially. Kubernetes’ self-healing ability only applies to pods. And the thing is, there are multiple layers to self-healing. To ensure application reliability, you’ll need (1) a self-healing infrastructure which includes worker and master nodes, (2) self-healing clusters achieved through self-healing masters (not necessarily workers), (3) self-healing pods, and (4) self-healing Kubernetes which, while contingent on a self-healing infrastructure, requires specific configuration. So, a self-healing infrastructure does not guarantee self-healing Kubernetes.

With so many layers at play, it’s no wonder people get confused, especially those new to Kubernetes.

Kubernetes ensures self-healing pods, and if a pod goes down, Kubernetes will restart a new one. However, if an entire node goes down, Kubernetes generally isn’t able to spin a new one up. From a self-healing point of view, infrastructure could turn into the weakest link in the chain, jeopardizing the reliability of your applications. That’s because your clusters are only as reliable as the underlying infrastructure. Meaning the best Kubernetes management solution, cannot protect you from poor infrastructure provisioning.

What Kubernetes doesn’t do

The appeal of Kubernetes lies mostly in the promise of running applications in a reliable, highly available, and stable way across multiple infrastructures. Kubernetes is uniquely positioned to deliver on that promise. Although Kubernetes has a lot of capabilities, it doesn’t mean they are all available by default. Here are a few things you should know about self-healing and Kubernetes:

Kubernetes can’t provision infrastructure

The infrastructure must be provisioned for Kubernetes before it can install, configure, and connect components to create a cluster. This is basically what Kelsey Hightower described in his Kubernetes the hard way tutorial. Some of the complexity deals with automating infrastructure provisioning. Since Kubernetes doesn’t do it, it’s either left to you or your provider. But beware, infrastructure provisioning is not yet a standard feature (though we strongly believe it should be).

Kubernetes’ self-healing ability is contingent on the underlying infrastructure

To self-heal, Kubernetes must be able to provision the infrastructure or have already dedicated nodes it can access when needed. Either way, infrastructure must be made available to Kubernetes beforehand. No infrastructure means no self-healing. While a lack of infrastructure automation may be acceptable on-premise — there aren’t many infrastructure automation options available (yet) — it’s hard to justify in the cloud or in automated environments like VMware.

Self-healing of Kubernetes components depends on installation and configuration

If a component goes down, in most cases, Kubernetes won’t be able to restore it. While it will alert the system admin that the node where the component ran isn’t available anymore, it can’t simply restart the kubelet, etc. cluster, operating system, or any other system component Kubernetes relies on. In order to do so, Kubernetes requires specific configuration.

SEE ALSO: Securing containers throughout the entire build-ship-run lifecycle means shifting left and right

Whether a Kubernetes platform provisions the infrastructure or not has clearly significant ramifications. There are three different vendor approaches you should be aware of:

  1. Infrastructure provisioned by user (possibly manually): users are expected to provide a set of pre-provisioned/pre-configured nodes on which the solution will deploy the Kubernetes components and create the cluster.
  2. Masters provisioned automatically but workers manually: highly available managed master nodes (achieved through self-healing) will only ensure highly available clusters; for applications to be highly available, self-healing worker nodes are also required (this is typically the cloud approach which manages your masters, but not your nodes).
  3. Infrastructure provisioned automatically: this approach automates the process whenever possible leveraging native infrastructure automation tools such as AWS’ Cloud Formation or VMware’s vSphere — which is the approach Kublr took.

Why does it matter? First, if the platform doesn’t provision the infrastructure, it cannot ensure the infrastructure was set up and initialized correctly. Who or what will guarantee that the right packages and versions of the operating system and software components are installed on the nodes? Those things really matter as your Kubernetes cluster will be only as reliable as the infrastructure it runs on.

In short, you can’t achieve true self-healing applications if you don’t have a self-healing infrastructure. If you don’t and a node goes down, you are facing potential downtime of minutes, possibly even hours, all depending on how quickly your Ops team reacts to the incident. That’s why infrastructure provisioning is built into the Kublr Platform allowing for the setup of self-healing nodes whenever the infrastructure allows.

Kublr’s answer to self-healing nodes

The Kublr Platform leverages native infrastructure tools and capabilities (e.g. Cloud Formation or vSphere) to automate provisioning, scaling, and recovery. Autoscaling groups are automatically set up for every node, workers, and masters. It sets up the infrastructure so that when a virtual machine goes down, another one will automatically go up. It watches those events ensuring that when the replacement machine is up it will be automatically connected back to cluster — no human intervention required. This is possible on any infrastructure that supports automatic recovery and instance restart.

SEE ALSO: Buildah: Build containers fast and easy without Docker

Each machine instance – whether virtual or physical – is set up with an initialization code, the Kublr agent, which ensures the node matches the desired state. The agent monitors the node ensuring node and components are running properly. In fact, Kublr connects those nodes even before Kubernetes does, ensuring that worker and master nodes can communicate securely and set up secure connections between components.

Native infrastructure provisioning tools enable Kublr to deploy self-healing and fully recoverable clusters. Recovery and self-healing is automatically deployed each time a cluster is created. Additionally, the Kublr control plane provides an extra level of safety and reliability. As the Kublr Control Plane monitors clusters and infrastructure, it checks if the cluster parameters (which include CPU, RAM, disk, and network usage, as well as many other Kubernetes specific parameters) are within the configured limits and alerts users in case of any deviation.

To summarize, self-healing does not equal self-healing — there are multiple layers which are all needed to ensure true application reliability. While numerous providers claim production-readiness, the lack of self-healing nodes is clear proof that at least some critical pieces aren’t yet automated giving way to manual, error-prone work. Kublr is among the first to provide fully self-healing clusters, including master and worker nodes self-healing and infrastructure management, ensuring reliable, highly available applications.

Slava Koltovich
Slava Koltovich is CEO of Kublr, an enterprise-grade Kubernetes platform. Passionate about increasing developer productivity, he has incubated and launched a number of successful products improving development processes. During his 20 years in the industry, Slava has delivered software to some of the world’s largest companies, including Facebook, Microsoft, and Google.

Oleg Chunikhin
With 20 years of software architecture and development experience, Kublr CTO Oleg Chunikhin is responsible for defining Kublr’s technology strategy and standards. He has championed the standardization of DevOps in all Kublr does and is committed to driving adoption of container technologies and automation.

Leave a Reply