Improving response time and latency measurements
In his last article, Daniel Witkowski described common mistakes that can be made during performance analysis. In this tutorial, Daniel shows us how to use available open source tools to improve the quality of your measurements.
Frequently, response time or latency measurements are done from inside the application. Even more often, they are done in the same method that should be monitored. This would occur in most simple scenarios. In our case let’s use this approach to measure the processing time of a simple task.
There is a small change that this code does in relation to what we think it should (calculate a sum from 1 to 800 mln) but this is outside the scope of this article, so let’s assume it does something that takes time. Also we would like to have some performance details, so let’s collect:
- Average processing time – as total processing time divided by total number of samples
- Maximum processing time
- Average Throughput – as total number of transactions divided by time of the test
Later I will describe why this set of parameters, picked for monitoring, was wrong however this is very frequently a set of data people collect during performance tests.
We would also like to see how the application profile changes when there is some more load, so we will try to create some garbage in the background so the JVM (or Garbage Collector) will have some work to do.
Assuming we learned something useful from the previous article on pre-production testing, let’s also enable GC logs and use jHiccup for monitoring. We’ll also use CMS Garbage Collector for some concurrent work (-XX:+UseConcMarkSweepGC).
This is a method that will print details about each cycle. As we use nanoTime() method for better (but not really accurate) measurements we divide it to present time in milliseconds and Throughput per second.
The line from tests will look like this:
Counter TotalTime AvgTime Max Throughput (tps)[48.04]
Base line without any additional load
We want to have a best-case scenario, so we run this code without any additional tools enabled. The results are presented in the graphs. Throughput is at fifty transactions per second level (Figure 1). Average time is below 20 ms and maximum time is at 30 ms (Figure 2). Processing time increases linearly (Figure 3). GC logs do not show any pause and jHiccup also confirms that there were no pauses over 15 ms (Figure 4). All data is consistent.
First run with a custom memory fragmentation tool
Now we have base line results ready, let’s create some simple code that will create some garbage. The simplest solution is to create a thread that is running all the time and allocating some objects in memory. When we open the GC log we see there are a lot of pauses (red line) and GC is running frequently (blue line shows memory utilization) but pauses are very consistent. Unless the application load is very unusual it is not something that simulates real user traffic (Figure 5). When we open jHiccup we also see GC decreases consistently, but we now have pauses almost all the time which is not good (Figure 6).
We can try to tune our GC Thread by adding some sleep time, randomizing sleep time and the amount of data allocated, but there is an easier way to do this. Azul Systems’ CTO Gil Tene created a Fragger tool that does exactly what we need. It is open sourced and available on GitHub so it’s really easy to use.
What is a Fragger?
Fragger is a heap fragmentation inducer, meant to induce compaction of the heap on a regular basis using a limited (and settable) amount of CPU and memory resources. The purpose of HeapFragger is (amongst other things) to aid application testers in inducing inevitable-but-rare garbage collection events, such that they would occur on a regular and more frequent and reliable basis. Doing so allows the characterization of system behaviour, such as response time envelope, within practical test cycle times.
HeapFragger works on the simple basis of repeatedly generating large sets of objects of a given size, pruning each set down to a much smaller remaining live set after it has been promoted, and increasing the object size between passes such that it becomes unlikely to fit in the areas freed up by objects released in a previous pass without some amount of compaction. HeapFragger ages object sets before pruning them down in order to bypass potential artificial early compaction by young generation collectors.
By the time enough passes are done so that aggregate space allocated by the passes roughly matches the heap size (although a much smaller percentage is actually alive), some level of compaction is likely or inevitable. HeapFragger’s resource consumption is completely tunable; it will throttle itself to a tuneable rate of allocation, and limit its heap footprint to a configurable level. When run with default settings, HeapFragger will occupy 10% of total heap space, and allocate objects at a rate of 50MB/sec.
Fragger works as a Java agent, so there is no code change required. Multiple Java agents can be added to the Java command line, so there is no risk when you already use some tools through the Java agent interface:
-javaagent:Fragger.jar="-a 1200 -s 512"
Fragger has some additional nice features as it ships with a pause detector and will output to system error stream a line with message when it detects pause. Threshold is configurable:
*** PauseDetector detected a 2252 ms pause at Thu May 21 15:12:33 CEST 2015 ***
Second run with Fragger
This time when we open the GC log we see much better memory profile (blue line). Also, we see a lot of pauses ranging from 0.1 second to 2.3 seconds (Figure 7).
When we compare this with jHiccup we see a similar pro – file. There are frequent pauses in 0.5 second ranges and at the end they increase to over 2 seconds (Figure 8). Since we were able to see a similar profile in both GC logs and jHiccup we expect that this will have a visible impact on data we measure in our application.
Throughput decreased and we see at the beginning similar Throughput as before (~50TPS) when it quickly drops to over 20TPS (Figure 9). After a while it goes over 40TPS but then drops below 40TPS. If we compare this graph to GC activity we can correlate more GC activity with less Throughput. Starting from that we can experiment with Fragger settings, heap sizing and Garbage Collector tuning to find the best configuration for our application. There is one more enhancement we should do while monitoring Throughput. Currently, we show “total historical Throughput” as it is calculated for total run time. This is good enough for batch like application when we do not care how Throughput changes over time as long as total work can be finished on time.
When we look at response time results they are rather surprising (Figure 10). They did not change at all! Even though we see 2.3 second pauses and jHiccup shows the 75 percentile is over 100 ms we still have the same results. We know they are wrong but why? If we only look at the response time profile we would assume we have a very stable and consistent application and we would not look to change anything at all! The problem we have here is related to the way we measure and the way the JVM works. In our case, whole processing is done in a loop that is usually not interrupted by GC threads, so GC waits until our calculation finishes and stops. This stop is not measured. We can add additional measurement to monitor time from last finish to current start but it will not have any business value. To help diagnose such problems, Azul Systems created an open source tool that helps to improve latency measurements. This tool is called LatencyUtils and it is available on GitHub.
Important note: When we work with transactional applications it is much better to calculate Throughput in some smaller time intervals. For the next test cycle we will keep all transactions including their timestamps and we will calculate Throughput in each interval. Let’s assume 1 second as a good interval as we have such a profile for jHiccup.
LatencyUtils from Azul Systems
The LatencyUtils package includes useful utilities for tracking latencies. Especially for common in-process recording scenarios, which can exhibit significant coordinated omission sensitivity without proper handling.
Latency tracking of in-process operations usually consists of simple time sampling around a tracked operation e.g. a database read operation, for which latency statistics are being developed may be surrounded by time measurement operation immediately before and after the operation is performed, with the difference in time recorded in some aggregate statistics gathering form (average, std. deviation, histogram,. etc.) This is later used to report on the experienced latency behaviour of the tracked operation.
The problem with this extremely common form of latency behaviour tracking is that whenever pauses occur in the system, latency statistics become significantly skewed toward falsely-optimistic values. This occurs in two key ways:
- When a pause occurs during a tracked operation, a single long recorded latency will appear in the recorded values, with no long latencies associated with any pending requests that may be stalled by the pause
- When a pause occurs outside of the tracked operation (and outside of the tracked time window) no long latency value would be recorded, even though any requested operation would be stalled by the pause
Important note: When measuring response time or transaction latency, use at least one external monitoring tool to confirm your measurements are correct.
The LatencyStats class is designed for simple, drop-in use as a latency behaviour recording object for common in-process latency recording and tracking situations. LatencyStats includes under-the-hood tracking and correction of pause effects, compensating for coordinated omission. It does so by using pluggable pause detectors and interval estimators that together with LatencyStats will transparently produce corrected histogram values for the recorded latency behaviour.
LatencyUtils in action
This time we used LatencyUtils on top of existing measurements in a really simple way. It is worth showing full output (I removed some higher percentiles for the smaller list) so everybody can see the details. It is visible that 50 percentile is at 19 ms and 77 percentile is over 50 ms. In our code we had maximum at 31 ms. Also it is interesting to see that total number of samples is over 13,000 and this means that over 3,000 samples were added by LatencyUtils to cover for uncounted pause time of the JVM.
Improved Throughput measurement
To improve Throughput measurement all results were stored in an ArrayList. After the test was finished, Throughput in each second was calculated. This is how Throughput changes when GC activity is visible (Figure 11).
It is clearly visible that GC pauses have a direct impact on application Throughput. Each time there is a GC event Throughput drops to 30TPS or below (from 50TPS level). Using this data it is possible to tune the application (or JVM) for some given goals – like sustained Throughput (in this case if we remove startup phase we can have guaranteed 20TPS and probably 30TPS with some small tuning) or application availability (based on Throughput value from GC logs). If we had not calculated Throughput in each second we would have believed we had sustained Throughput of 40TPS. This assumption would have at least 50% error.
In this article we have learnt a few important things:
- Response time measurement inside application code might have a very big measurement error
- Always measure important metrics with at least one external tool
- Use Java agents as a way to introduce additional tools to the application (used in jHiccup and Fragger)
With very little overhead and freely available tools it is really easy to extend your current monitoring framework through additional tools to confirm you do not miss a lot of transactions due to Coordinated Emission Problem.