Why it’s difficult to find performance problems during pre-production tests
In this new series of articles on testing, Daniel Witkowski identifies common mistakes made during performance testing and demonstrates how to use widely available free open source tools to find response time problems before they reach production.
If your charter is delivering optimum application performance, either in a human/webscale application or machine-machine, you need performance metrics and methodologies that you can trust. We frequently hear from clients whose internal tools and processes show acceptable performance, while their end-users complain of vastly different (and far worse) results.
Since the key to any human or machine-facing application is predictable performance (and delivering a better end-user experience is a great way to drive more traffic and more revenue) – we’d like to highlight some ways where common tools and testing assumptions can deliver misleading results.
Once we’ve described the problem we’ll also discuss how you can deliver test metrics that more closely match real-world performance.
Introducing the “Coordinated Omission” Problem
Coordinated Omission is a problem that exists in almost all major load and performance testing systems. The easiest way to explain it is to show it on a real life example. Consider a simple load test scenario with multiple test threads. Each thread will invoke some method and measure how long it took. I will use JMeter, a free and open-source tool for load tests.
Let’s use the results as shown below and assume we want to have an SLA of 1 second and we want to offer this SLA for 90% of our users. In this case we have a good average and 90% results, so all should be fine.
But it’s not. We always need to have at least 2 different ways to validate the results we obtained from any tool. In this case even the default report from JMeter can tell us that something is wrong.
In this case we see that something went wrong at 17:50 and lasted till 17:54. What is interesting is that each sample shows smaller pauses and it goes back to typical response time after 17:55. In reality there was a pause for 4 minutes where nothing happened; no requests were processed and no data was sent to end users.
JMeter’s Response Time Graph shows this in a good way. If this pause really happened then each user would wait for a shorter amount of time. The first would wait for four minutes, but somebody who started a request 3 minutes after would only wait for 1 minute. But wait a minute. If there was a four minute pause why on the graph is the maximum pause only 12.5 seconds?
This shows us another common mistake – averaging period. In our case we picked 10 seconds, so there were a lot of requests that were fast and only 10 of them that were slow. Therefore, the average is skewed. If we decrease averaging period to 1 second we see that the maximum response time is higher.
Unfortunately JMeter did not give the best representation of values on the Y axis, so you must believe me that the maximum is actually 54 seconds, over 4 times higher than before. All we changed was the way of presenting results. It’s still 4 times less that it should be.
An easy way to spot these kind of problems is to compare the same metrics between different measurement results. In this case we had 232 seconds maximum time in our first Aggregate Report but only 12 seconds on the graph. On the other hand, we see that on the Aggregate Graph the 90 percentile is below 1 second while on the graph we see that 30% of the time the application was not responsive. Despite neither set of results being completely accurate we can compare them both to get a slightly better picture.
SEE ALSO: Tests need love too
The Coordinated Omission Problem describes situations when you have missed a lot of potential samples in your result set. In our case we had an average response time of 0.5 seconds and 10 concurrent JMeter threads. So for each second we should have around 20 samples. The test run was for 17 minutes (1,020 seconds), so we should have at least 17,340 samples (assuming ideal distribution). We recorded 15,241 samples, so we are missing at least 2,000 samples which is roughly over 13% of total requests. There is no easy way to fix this. You can refer to this forum thread which discusses several ways to improve the process.
How frequently this can really happen?
You might say that this is something that doesn’t happen frequently. Of course missing 15% of requests is not common, but if you had 99% you might consider that your system is performing well. Let’s imagine you have a webpage and the main page consists of 100 items, in this case each user would be exposed to 1% of the slowest requests! You might use Firefox’s Extended Statusbar plugin to calculate the number of items on your websites. You can refer to this blog post by Gil Tene of Azul System showing that 1% happens to almost all websites.
All load test tools, regardless if they are commercial or open-source have the same problem. If you are aware of it and understand your results it’s OK but if you just accept reports from these systems you might miss a lot of performance related problems during your QA/Stress test session.
Another important thing to keep in mind while preparing reports is the averaging period. Usually you calculate average response time in a whole testing window. In this case having a few very long response times will increase this value. In almost all cases, using median or mean response time is much better than average. Let’s use the same example with JMeter and test application. This time there was no long pause to worry about. Let’s use default values that JMeter suggests. Its 10 seconds, so it sounds right.
We see an average response time of around 500ms with some peaks to around 600-700ms. Based on this you would assume that you can use 700ms as an SLA for your application. There is only one peak that is close to 700ms, so we should be fine. Wrong! Let’s see what happens when we use a 1 second averaging period.
Now we see some more peaks reaching 1 second. This is because we had a too long averaging period before we missed them. Let’s see how it looks with 250ms intervals.
Now we even have spikes above 1 second, but we also see there are a lot of requests that are at 250ms level. If we decrease the sampling interval we get more meaningful details. In this case using an average response time from the entire testing window would provide an inaccurate picture.
How to improve your measurements
What can you do to avoid such issues while performance testing your application? A general rule should be to use multiple data sets and see if they show similar results. Azul Systems’ CTO Gil Tene created jHiccup. It’s open sourced and available at Github or from Azul’s website.
It’s a simple Java agent you can plug into your application. It has one thread that wakes up every millisecond and measures how much time it took to perform a simple operation. It will produce a histogram that shows your JVM level consistency profile. If this is similar to your application – that’s good. If you see something very different then it would be a good thing to dig deeper before deploying your application to a production environment.
To integrate jHiccup into your application add this line:
-javaagent:/jHiccup.2.0.2/jHiccup.jar="-d 5000 -I 1000 -l hiccuplog -c"
You can also use a comma as a separator if you have problems with double quotes in your script files, in this case the above line would look like this:
I attached jHiccup to the test process and it looks like this:
On the top picture you can see the maximum response time measured in the sampling period (1 second by default). Just after 440 seconds from the application start we notice that response time starts to grow. It reaches a maximum of 233 seconds which is similar to what we got with JMeter. On the picture below you can see the percentile of response times. In this case you see that 90 percentile is at least 130 seconds. It should concern when you compare 130 seconds from jHiccup data with 0.69 seconds from JMeter’s Aggregate Report.
So, just by adding one simple monitoring tool to your application you are able to prove your results generated by a load test tool. Another interesting feature of jHiccup is its ability to present data in a spreadsheet, so you can monitor your Java runtime latency. In our case when we remove one big pause and zoom in to the 300ms range we see that jitter at the JVM level is at the 50ms level.
So if you are intending to tune your application to meet an SLA of below 50ms you would not be able to do it because your JVM (or operating system) jitter is already higher than 50ms. It is always good to know your best case when you deploy your system to production. Do you know how your runtime platform is working on your current production system? Do you know what kind of pauses your system can create even without an application running?
If you are interested in knowing this there is a simple way to find out. jHiccup can start an idle process and just measure for a defined amount of time the latency of an empty JVM. All you need to do is run this line:
./jHiccup -d 4000 /usr/bin/java org.jhiccup.Idle -t 3600000
This will run for 1 hour and show you an output like this:
Here you see that even on an idle system and idle application has 50ms peaks. Of course this is a laptop with Windows, so you might not expect much, but do you know how good your server is?
In this article I was able to show you some common mistakes that people make when performing load or stress tests. Using either open source tools like jHiccup or comparing results between different data sets should provide you with enough evidence to work out if there is something wrong with your summary data.
It is often difficult to find the root cause of the problem. However, having knowledge about possible problems might prevent you from deploying system with performance issues to the production environment.