Nobody puts Java in a container
© Shutterstock / MOLPIX
What are the pitfalls about running Java or JVM based applications in containers? In this article, Jörg Schad goes over the challenges and how to solve them.
This post is adapted from a session presented at Codemotion 2017.
Tl;dr: The Java Virtual Machine (not even with the Java 9 release) is not fully aware of the isolation mechanisms containers use internally, this can lead to unexpected behavior between different environments (e.g., test vs production). To avoid this behavior one should consider overriding some default parameters (which are usually set by the memory and processors available on the node) to match the container limits.
First off, I have to admit that I mostly develop with C++ and Golang. So why do I care about running JVM based applications in container? The reason is that I work on Apache Mesos and DC/OS , which are both platforms enabling users to run their containerized applications (or a large number of data services such as Apache Spark, Flink, Kafka, and many more, which are also utilizing containers).
So, when they run into problems, they ask people like me for help. That’s why I care about this topic. Also, I have two great colleagues, Ken Sipe and Johannes Unterstein who are always willing to help me with their endless Java knowledge.
This article is structured as follows:
We will first try to understand what containers actually are and how they are build from Linux kernel features. We will then recall some details of how the JVM deals with Memory and CPU. Then, finally, we will bring both parts together and see how the JVM runs in containerized environments and which challenges arise. Of course, we will also discuss how to deal with these challenges. The first two sections are mostly intended to provide the prerequisites for the last part, so if you already have deep knowledge about container or JVM internals feel free to skip/skim over the respective section.
While many people know about containers, how many of us know that much about the underlying concepts, like C Groups and namespaces? I’d like to give a bit of an introduction to these topics. These are the building blocks for understanding the challenges of running Java in containers.
Containers are a really cool tool to package applications and basically write once, run anywhere – to a certain degree. That’s the promise of containers, anyways. To what degree is this promise holding true? Many of us working with Java have heard this promise before: Java claims that you can write an application and run it anywhere. Can these two promises be combined in Docker containers? We’ll see.
On a high level, containers appear like a lightweight virtual machine.
- I can get a shell on it (through SSH or otherwise)
- It “feels” like a VM
- own process space
- own network interface
- can install packages
- can run servers
- can be packaged into images
On the other hand, they’re not at all like virtual machines.
They are basically just isolated linux process groups, one could even argue containers are just an artificial concept. This means that all “containers” running on the same host, are process groups running on the same linux kernel of the host.
Let us look at what that means in more detail and start two container using docker:
$ docker run ubuntu sleep 1000 &  47048 $ docker run ubuntu sleep 1000000 &  47051
Let us quickly check both are running:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f5b9bb7523b1 ubuntu "sleep 1000000" 10 seconds ago Up 8 seconds lucid_heisenberg b8869675eb5d ubuntu "sleep 1000" 14 seconds ago Up 13 seconds agitated_brattain
Great, both containers are running. So what happens if we look at the processes at the host? Just a warning, if you try this with Docker for Mac, you will not be able to see `sleep` processes, as Docker on Mac will run the actual containers inside a virtual machine (and hence the host the containers are running on is not you Mac, but the virtual machine).
$ ps faux …
So here we can see both `sleep` processes running as child processes of the cointainerd process. As a result of “just” being process groups on a linux kernel:
- Container also weaker isolation compared to virtual machines
- Container can run with near-native speed CPU/IO
- Container launch in around 0.1 second (libcontainer)
- Container have less storage and memory overhead
As we have seen earlier, at the core containers are standard Linux processes running a shared kernel.
But what is the view from inside one of those container? Let us investigate that by starting an interactive shell inside one of the containers using
docker exec and then look at the visible processes:
$ docker exec -it f5b9bb7523b1 bin/bash root@5e1cb2fd8fcb:/# ps faux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 7 0.3 0.0 18212 3288 ? Ss 21:38 0:00 bin/bash root 17 0.0 0.0 34428 2944 ? R+ 21:38 0:00 \_ ps faux root 1 0.0 0.0 4384 664 ? Ss 21:23 0:00 sleep 1000000
From inside the container, we can only see one sleep task as it is isolated from the other containers.
So, how do containers manage to provide such isolated views?
There are two linux kernel features coming to the rescue: cgroups and namespaces . These are pretty much utilized by all container technologies, such as Docker, Mesos Containerizer, or rkt. Actually, the interesting bit is that Mesos has had its own containerizer from the early days. It also depends on the same part, so we use also cgroups and namespaces internally. As a result Mesos can even utilize docker images without having to rely on the docker daemon.
Namespaces are basically used for providing different isolated views on the system. So each container can see its own view on different namespaces like process IDs, on network IDs, or user IDs. It works the same for processes. So for example, in different containers the process ID 1.
While namespaces provide isolated view, control groups (cgroups for short) are actually used to isolate access to resources. So cgroups can be used for either limiting access (e.g., a process group can only use a maximum of 2GB of memory) or accounting (e.g., keeping track how many cpu cycles a certain process group consumed over the last minute). We’re going to see that in some more detail later.
As mentioned before, every container has its own view of the system and namespaces are used to provide these views for the following resources (amongst others):
- pid (processes)
- net (network interfaces, routing…)
- ipc (System V IPC)
- mnt (mount points, filesystems)
- uts (hostname)
- user (UIDs)
Consider for example the process ID namespace and our previous example of running two docker container. From the host operating system we could see the 13347 and 13422 as process IDs for the sleep processes:
But from inside the container it looked slightly different:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 7 0.3 0.0 18212 3288 ? Ss 21:38 0:00 bin/bash root 17 0.0 0.0 34428 2944 ? R+ 21:38 0:00 \_ ps faux root 1 0.0 0.0 4384 664 ? Ss 21:23 0:00 sleep 1000000
So from inside the container the
sleep 1000000 process has the process id 1 (in contrast to 13422 on the host). This is because the container runs in its own namespace and hence has its own view on process ids.
We’re diving a little deeper into control groups here. (To be precise we are talking about cgroups v1 here, there is also v2 with substantial differences.) As mentioned previously, control groups can be used for both limiting access and also for accounting purposes. As everything in Linux or Unix, it’s just like a hierarchical folder which can be viewed a tree. Here’s what such structure could look like:
- Each subsystem (memory, CPU…) has a hierarchy (tree)
- Each process belongs to exactly 1 node in each hierarchy
- Each hierarchy starts with 1 node (the root)
- Each node = group of processes (sharing the same resources)
The interesting part is that one can set limits for each node in this tree. Consider for example the memory subsystem:
- Each group can have hard and soft limits
- Soft limits are not enforced (i.e., only trigger a warning in the log), which may or may be useful for example for monitoring or trying to determine the best memory limit for a certain application.
- Hard limits will trigger a per-group OOM killer. This often requires some changed mindset for Java developers, as they are used to a `OutOfMemoryError` which they can react to accordingly. But in case of the containers with a hard memory limit, the entire container will simply be killed without warning.
When using docker we can set a hard limit of 128MB for our containers as follows:
docker run -it --rm -m 128m fedora bash
After having looked at the memory isolation, let us consider CPU isolation next. Here, we have two main choices: CPU shares and CPU sets. There’s a difference between them, which will be relevant.
CPU shares are the default CPU isolation and basically provide a priority weighting across all all cpu cycles across all cores.
The the default weight of any process is 1024, so if start a container as follows
docker run -it --rm -c 512 stress it will receive less CPU cycles than a default process/container.
But how many cycles exactly? That depends on the overall set of processes running at that node. Let us consider two cgroups A and B.
sudo cgcreate -g cpu:A sudo cgcreate -g cpu:B cgroup A: sudo cgset -r cpu.shares=768 A 75% cgroup B: sudo cgset -r cpu.shares=256 B 25%
Cgroups A has CPU shares of 768 and the other has 256. That means that the CPU shares assume that if nothing else is running on the system, cgroup A is going to receive 75% of the CPU shares and cgroup B will receive the remaining 25%.
If we remove cgroup a, then cgroup b would end up receiving 100% of CPU shares.
Note that you can also use CFS isolation for a more strict, less optimistic isolation guarantees, but we would refer to this blog post for more details.
CPU sets are slightly different. They allow to limit container to specific CPU(s). This is mostly used to avoid processes bouncing between CPUs, but is also relevant for NUMA systems where different CPU have fast access to different memory regions (and hence you want your container to only utilize the CPU with fast access to the same memory region).
We can use cpu sets with docker as follows:
docker run -it -cpuset=0,4,6 stress
This means, we would pin the containers to the CPUs 0, 4, and 6.
Let’s talk Java
Next, let us recall some details of the Java.
First of all, Java consists of several parts. It’s the Java language, the Java specifications, and the Java runtime. Here, we’re mostly going to talk about the Java runtime, since that actually runs inside our containers.
JVM’s memory footprint
So, what contributes to the JVM memory footprint? Most of us who have run a Java application, know how to set the maximum heap space. But there’s actually a lot more contributing to the memory footprint:
- Native JRE
- Perm / metaspace
- JIT bytecode
This is a lot that needs to be kept in mind when we want to set memory limits with Docker containers. And also setting the container memory limit to the maximum heap space, might not be sufficient…
JVM and CPUs
Let’s take a short look at how the JVM adjust to the number of processors/cores available on the node it is running on. There are actually a number of parameters which by default are initialized based on core count.
- # of JIT compiler threads
- # Garbage Collection threads
- # of thread in the common fork-join pool
So if the JVM is running on a 32 core node (and one did not overwrite the default), the JVM will spawn 32 Garbage Collection threads, 32 JIT compiler threads, ….
Next, let us look at how that works with containers.
JVM meets containers
Finally, we have all the tools available, and are ready to bring it all together!
So we have finished developing our JVM based application, and now package it into a docker image and test it locally on our notebook. All works great, so we deploy 10 instances of that container onto our production cluster. All the sudden the application is throttling and not achieving the same performance as we have seen on our test system. And our test system is even this high-performance system with 64 cores…
What has happened? In order to allow multiple containers to run isolated side-by-side, we have specified it to be limited to one cpu (or the equivalent ratio in CPU shares). Unfortunately, the JVM will see the overall number of cores on that node (64) and use that value to initialize the number of default threads we have seen earlier. As started 10 instances we end up with:
10 * 64 Jit Compiler Threads
10 * 64 Garbage Collection threads
10 * 64 ….
And our application,being limited in the number of cpu cycles it can use, is mostly dealing with switching between different threads and does cannot get any actual work done.
All the sudden the promise of containers, “Package once, run anywhere’ seem violated…
Just to look at it from a different angle, let us compare once more containers against virtual machines and where in each case the JVM is collection its information (i.e., # cores, memory, …) from:
In JDK 7/8, it gets the core count resources from
sysconf. That means that whenever I run it in a container, I am going to get the total number number of processor available on the system, or in case of virtual machines: virtual system.
The same is true for default memory limits: the JVM will look at the host overall memory and use that for setting its defaults.
So we can say that the JVM ignoring cgroups and that cause problems as we have seen above.
If you have paid attention you might wonder, why namespaces are not coming to the rescue here. After all we said, they create container specific views of the resources. Unfortunately, there is no CPU or memory namespace (also namespaces usually have a slightly different goal), so a simple
less /proc/meminfo from inside the container will still show you the overall memory on the host.
But, Java 9 supports containers!
With (Open)JDK 9, that changed. So Java now supports docker cpu and memory limits. Let us look at what “support” actually means.
The JVM will now consider cgroups memory limits if the following flags are specified:
In that case the Max Heap space will be automatically (if not overwritten) be set to the limit specified by the cgroup. As we discussed earlier, the JVM is using memory besides the Heap, so this will not prevent user from the OOM killer removing their containers. But, especially giving that the garbage collector will become more aggressive as the Heap fills up, this is already a great improvement.
With OpenJDK 9 the JVM will automatically detect cpusets and if set use the number of CPUs specified for initializing the default values discussed earlier.
Unfortunately, most users (and especially container orchestrators such as DC/OS) use CPU shares as the default CPU isolation. And with CPU shares you will still end up with the incorrect value for default parameters.
So what can I do?
Most important is probably to simply be aware of the issue and then decide whether is an issue for your environment.
If not, great. If it is a problem, you should consider manually overwriting the default parameter
(e.g., at least XMX for memory and XX:ParallelGCThreads, XX:ConcGCThreads for CPU) according to your specific cgroup limits.
Also OpenJDK 10 will improve the container support drastically: It includes, for example support for, CPU shares. For more details please check out this enhancement.