days
0
-83
-3
hours
-2
-1
minutes
-3
-3
seconds
-4
-3
search
A new tool open-sourced by DataStax

Detecting performance regressions with open source Hunter

Piotr Kołaczkowski
© Shutterstock / alena_rosewood

When developing new features of a complex, distributed, massively scalable software system, it is easy to accidentally degrade its performance, even despite being very careful. Regular code reviews, static code analysis, quality, and performance testing are not enough. In this post, we show how Hunter can be used to analyze thousands of performance metrics regularly and reliably flags performance changes.

Introducing Hunter, a new tool open-sourced by DataStax to automate the analysis of performance test results. Hunter solves the scaling limitations inherent in visual inspection of results and the difficulties in using threshold detection limits to identify issues.

Performance is the key property of many software products, including database systems. Before releasing a new version of software to the users, we want to make sure that its performance meets certain requirements. However, when developing new features of a complex, distributed, massively scalable software system, it is easy to accidentally degrade its performance, even despite being very careful. Regular code reviews, static code analysis, quality, and performance testing are not enough.

In this post, we’ll discuss these problems in more detail and show how Hunter can be used to analyze thousands of performance metrics regularly and reliably flags performance changes.

SEE ALSO: What can toy building blocks teach developers about security best practices?

Problem 1: Visually inspecting performance metrics doesn’t scale

Internally, the DataStax team runs a very large suite of automated performance tests. These are executed regularly at programmed time intervals on the most up-to-date development versions of our products, which are fetched from the main branches of our source code repository. They can be also executed on demand by developers, for example, to verify performance on the topic branch.

Each performance test produces a set of metrics that capture the performance characteristics of some feature of the product under a given workload. For each test, we record several metrics such as query throughput, mean and tail response times, and various system metrics related to CPU, memory, and I/O. Once a performance test finishes, it saves the metric values to a time-series database, which can be then visualized using Grafana or another similar data visualization software.

With the ability to easily inspect the results visually, an expert can usually tell right away by looking at graphs whether the numbers “look good” and can spot any anomalies such as unexpected value shifts or outliers. However, while we humans are evolutionarily optimized towards recognizing patterns, this manual approach doesn’t scale when it comes to thousands of tests, each with numerous metrics, in many different workloads and environments. This approach also relies on expert knowledge about the system under testing. So, a newcomer to a team might not be as effective in detecting performance problems in the same way as a seasoned expert who has worked on the product for years.

Problem 2: The utility of fixed threshold warnings depends on variables that can be hard to control

Performance changes can be easily detected by comparing the performance metrics of the last run with the previous run. Many developers use this approach when developing a new feature. They run the benchmarks for the original code, then switch to the topic branch, rerun the benchmarks and compare. If the difference is larger than a predefined fixed threshold (let’s say throughput dropped by 10%), they investigate why that happened.

This approach to analyzing performance test results has a huge downside: it is very susceptible to noise. For example, even if you don’t perform any changes and rerun the test twice in a row over the same codebase, you would probably obtain different results. The reason for that could be any number of things, including thermal throttling kicking in on the second run. Maybe your operating system suddenly became busier with something extremely important like an installation of a newer version of Java applet plugin. Or, if you’re running your tests in the cloud, you might hit a noisy neighbor.

In the case of Apache Cassandra, compaction or garbage collection can kick in at random times, which can slightly affect the performance results.

Under the best of circumstances, it can be difficult to control all the variables. For example, if your change detection automation relies on just two data points, and the difference threshold is low, the noise could trigger it far too frequently. The problem here is that when people start getting messages like, “We’ve found 14,094 performance regressions in your product” popping up every day, they begin to ignore those warnings. To avoid this, it’s important to set a threshold high enough to keep the number of false positives low. But how do you decide what that number should be?

Herein lies another problem: the threshold level you set depends on the amount of noise and the importance of the metric under observation. Some metrics could be very stable. For example, in a test that sends queries at a fixed rate to measure query latency, the throughput can be a very stable metric. If it drops by even as low as 1%, it could mean something went wrong with the execution of the test. But the 99.99 percentile of the latency recorded in the very same test can be very noisy, simply because it will be based on a much smaller number of data points. Hence, you probably don’t want to set a 1% relative threshold for that metric.

For some metrics the threshold may not matter that much, for example, your software is I/O bound and the typical CPU use is about 1% of a core. Most developers wouldn’t care if it is 0.5%, 1%, or even 2% as long as it stays reasonably low. This is the case, where a fixed threshold warning makes much more sense.

The Solution: Using statistical change point detection to automate analysis of performance test results

To solve the limitations inherent in visual inspection of performance test results and the difficulties in using threshold detection limits to identify performance issues, Hunter automates the analysis of performance test results.

Fortunately, the problem of change point detection has been a topic of interest in computer programming for years and as a result, researchers have come up with some useful solutions that are based on statistical methods.

The basic idea behind statistical change point detection is that instead of comparing only two consecutive test runs, you can record all of the previous test runs and learn more about the properties of the system based on multiple data points. This way, you can estimate the amount of typical noise (variance) and adjust the threshold for triggering the change so that it happens only when an unusual pattern in data is observed. This approach significantly improves the signal-to-noise ratio from the performance regression detection system.

When developing Hunter we tried a few algorithms: Bayes Change Point Detection, the algorithms found in the Ruptures package, and E-Divisive. We tested the algorithms on the performance data we had collected internally. Previously we observed a few regressions, as well as some performance improvements, and we tested whether they would be found by the change detection algorithms automatically.

Of the three algorithms, E-Divisive turned out to be the most accurate. It provided good sensitivity at catching real regressions but at the same time, it wasn’t very sensitive to noise. We particularly liked the fact that it didn’t require a huge number of data points. As few as 10 test runs were enough to get meaningful results. Based on our tests, we decided to use E-Divisive as the main change point detection algorithm for Hunter.

E-Divisive algorithm works as follows:

  1. Find the data point that splits the time series into two subseries, such that a measure of the difference between the left and right subseries (the empirical divergence) is maximized. This measure takes into account the sums of absolute differences between the metric values within each subseries as well as the sums of differences between the metric values one taken from the left and another from the right subseries. Generally, the less variance within each subseries and the higher the difference across subseries — the higher is the measure of difference and the more likely the given candidate point wins.
  2. Compute a statistical test of significance of the difference between the left and the right subseries for the best candidate point. If the difference is statistically significant, add the candidate point to the result change point set and continue. Otherwise, return.
  3. Run the algorithm recursively on the left and on the right subseries.

SEE ALSO: Who should keep software development safe and secure? It’s the engineering organization

Improvements to Hunter

Initially, when we built Hunter, we used the library published in the mongodb signal-processing-algorithms package without any changes, but we soon realized a few minor issues with it.

The first one, though minor, was related to performance. Even though the number of test runs for a single test is not huge (order of hundreds), once multiplied by the number of metrics per single test and the number of different tests, it gives us easily tens of thousands of data points. Still, running E-Divisive for a single metric and a single test with a hundred test runs took a few seconds despite using the version of the library that was partially written in C for better speed. In total, verifying a release could take many minutes, which isn’t ideal.

By profiling, we have found the most costly part of the E-Divisive implementation was significance testing. The default significance tester computed the significance level directly from the definition, attempting different random permutations of data. It also checked how many times the difference measure was higher on the permuted data set than the one obtained on the original data. If we get better results by randomly permuting data too often, this means our result is not significant. Unfortunately, for this method to work reliably, a high number of permutations must be tried — the default was set to 100. Lowering the number of permutations improved performance, but lowered the reliability of the algorithm.

The second problem that some users complained about was the fact that the results were not repeatable. Sometimes Hunter flagged a test run as a change point, but in the next run, that very same test run wasn’t flagged. It turned out the randomness used in the significance tester was responsible for that effect.

Fortunately, the significance tester was made pluggable by the library authors, and we managed to replace it with a Student’s T-Test. The T-Test is probably not as powerful as the original code, but it improved the performance by a good order of magnitude and still delivered good enough results. Moreover, it is fully deterministic, so the results became repeatable.

Another issue we found after we were able to collect more data was that some test runs that were earlier flagged as large change points suddenly stopped being flagged. It wasn’t a functional regression in Hunter. It could still find the change points if we artificially reduced the number of data points, for example, the last 80 test runs. A closer inspection revealed that the problematic change-points were all regressions that had been fixed a few days after being introduced, soon enough that the number of regressing test runs was a lot smaller than the total number of test runs analyzed. Because E-divisive divides the series always in two subseries, not three, the majority of data points on both sides were “good” runs, without the regression. Also, the data points on one side of the split had another not-yet-found change-point that increased the variance. This caused the difference to be insignificant, so the algorithm stopped too early.

We applied two additional fixes to improve the detection sensitivity of temporary regressions. The first fix split the data set into multiple windows each. Each contained fewer data points and let E-Divisive analyze one window at once. To allow finding regressions close to the edges of the windows, the windows have some overlap, and additional care is taken to not report the same points multiple times.

The second fix relaxed the significance test by increasing the allowed P-value, which enabled the algorithm to continue splitting even if the initial split candidate wasn’t very good. In order to not allow weak points into the result, we added a post-filtering phase, where we reevaluate the P-value of the found change-points. At this stage, the P-values can be computed much more accurately because we know all the split points. We reject the ones with a P-value larger than the value requested by the user. We also reject the points with relative change smaller than a configurable threshold.

We’ll continue to improve on Hunter as needed and encourage you to try it for your own performance testing. Hunter is licensed under Apache License, Version 2.0 and you can find it on GitHub here.

Author

Piotr Kołaczkowski

Piotr has been working as a Software Engineer at DataStax since 2012, where he has contributed to many features, enhancements, and bug-fixes to DataStax Enterprise, a highly scalable distributed database system based on Apache Cassandra. He was the initial author of Spark Cassandra Connector. Recently he’s been busy learning Rust by developing a Fast Duplicate File Finder.


guest
0 Comments
Inline Feedbacks
View all comments