Troubleshooting CPU spike in a major trading application
This article by Ram Lakshmanan examines an example of how to troubleshoot a CPU spike within a major trading application in North America. See what was causing the issue, how it was fixed, and what lessons you can take away.
In this article we are going to discuss how we troubleshot a CPU spike problem that surfaced in a major trading application in North America. All of a sudden, this application’s CPU started to spike up to 100%.
In fact, this team didn’t make any new code deployment, any environmental changes, they didn’t flip any flag settings – but all of a sudden the CPU started to spike up.
We even validated whether there was an increase in the traffic volume, which attributed to this traffic volume spike. But there was no increase in the traffic volume as well.
This application was running on Java, Tomcat technology stack. We requested Site Reliability Engineering (SRE) team to capture the following two artifacts from the machine where this problem was happening:
1. top -H output
2. Thread dumps
Let’s see what these artifacts contain in this section.
1. top -H
Always CPU spikes are caused because of threads. So we had to isolate the threads which were causing this CPU spike. Apparently this application has hundreds of threads. From these hundreds of threads we need to identify the threads which are causing the CPU consumption to spike up? This is the first challenge.
This is where ‘top’ unix command line utility tool comes handy. Most of you might be familiar with ‘top’ unix command line utility. This tool shows all the processes that are running on the device. It also shows the CPU, memory consumed by each of those processes.
This tool has a secret ‘-H’ option, which a lot of engineers aren’t familiar with. It can be invoked like this:
$ top -H -p <PID>
Where PID is your application’s process Id. Apparently this application’s process ID is 31294. Thus the SRE team issued this command.
$ top -H -p 31294
When the ‘top’ tool is invoked with this ‘-H’ option, it will start to display all the threads that are running within that particular process. It will also display the amount of CPU and memory consumed by each thread within that process. Below is the output we got from this trading application:
As you can see from the “top -H” output, there is a thread with id ‘11956’ in the first row. Just this thread alone is consuming 60.9% of the CPU. Bingo!! This is the first win. Now using this ‘top -H’ option, we have identified the thread which is consuming a high amount of CPU.
Our next challenge is to identify the lines of code that this ‘11956’ thread is executing. This is where thread dumps come handy. Thread dumps shows all the threads that are running within the application and their code execution path (i.e. stack trace).
This blog highlights 8 different options to capture the thread dumps. You can use the option that is convenient to you. We used the ‘jstack’ tool which is part of the JDK to capture the thread dumps.
Best Practise: Capture at least 3 thread dumps,in a gap of 10 seconds between each thread dump.
Analyzing the data
Now we uploaded both the ‘top -H’ output and ‘thread dumps’ to fastThread tool. If you are not sure how to upload the ‘top -H’ output and thread dumps to fastThread tool, here are the instructions for it.
This tool has the ability to analyze thread dumps, and ‘top -H’ output and generate intuitive reports. This tool analyzed and generated this beautiful report.
This report contains ‘CPU | Memory’ section. This is the section where the tool marries top -H output and thread dump and provides a CPU & memory consumed by each thread in the application.
In the above screenshot we can see the “WebContainer:18” thread in the first row reported to be consuming “60.9%” CPU.
On the last column you can see the code execution path (i.e. stack trace) of this thread. You can see the tool now reporting the thread name (i.e. ‘WebContainer:18’) and the code path that it was executing (which wasn’t available to use when we saw the raw top -H output).
You will notice that ‘WebContainer: 18’ thread was executing the java.util.WeakHashMap#put() line of code. For the folks, who aren’t aware, HashMap is not a thread-safe implementation. When multiple threads invoke HashMap’s get() & put() methods concurrently then it can result in infinite looping.
When thread loops infinitely CPU consumption will start to skyrocket. That’s the exact problem happening in this application. Once WeakHashMap was replaced with ConcurrentHashMap the problem was resolved.
We hope you find this simple technique useful.