Trade-offs in benchmarking
Is it quality you’re looking to improve? Or performance? Before you decide on what kind of a benchmark your system needs, you need to know the spectrum of cost and benefit tradeoffs.
Benchmarking software is an important step in maturing a system. It is best to benchmark a system after correctness, usability, and reliability concerns have been addressed. In the typical lifetime of a system, emphasis is first placed on correctness of implementation, which is verified by unit, functional, and integration tests.
Some performance concerns can be addressed by the foresight of the design. Later, the emphasis is placed on the reliability and usability of the system, which is confirmed by the monitoring and alerting setup of a system running in production for an extended period of time. At this point, the system is fully functional, produces correct results, and has the necessary set of features to be useful to the end client. It is only at this point that concerns about the performance of the system become relevant. At this stage, benchmarking the software helps us to gain a better understanding of what improvement work is necessary to help the system gain a competitive edge.
The two kinds of benchmarks
There are two types of benchmarks one can create – performance and quality. Performance benchmarks generally measure latency and throughput. In other words, they answer the questions: “How fast can the system answer a query?”, “How many queries per second can it handle?”, and “How many concurrent queries can the system handle?” Quality benchmarks, on the other hand, address domain specific concerns, and do not translate well from one system to another. For instance, on a news website, a quality benchmark could be the total number of clicks, comments, and shares on each article. In contrast, a different website may include not only those properties but also what the users clicked on. This might happen because the website’s revenue is dependent on the number of referrals, rather than how engaging a particular article was.
Speaking of revenue, the goal of a benchmark is to guide optimizations in the system and to define the performance goal. A good benchmark should be able to answer the question “How fast is fast enough?” It allows the company to keep the users of the system happy and keep the infrastructure bills as low as possible, instead of wasting money on unneeded hardware.
There’s a spectrum of cost and benefit tradeoffs a benchmark designer should be aware of. Specialized benchmarks that utilize realistic workloads and model the production environment closely are expensive to set up. A common problem is that special infrastructure needs to exist to be able to duplicate the production workload. Aggregation and verification of results is also a very involved process, as it requires thorough analysis and application of moderately sophisticated statistical techniques. On the other hand, micro-benchmarks are quick and easy to set up, but they often produce misleading results, since they might not be measuring a representative workload or set of functionality.
Ask a question
To get started with designing a benchmark, it is helpful to pose a question for the system, e.g. “How fast does the page load for the user when they click to see contents of their cart?” By pairing that with the goal of the benchmark, e.g. “How fast does the page need to load for a pleasant user experience?” this gives the team guidance for their optimization work and helps to determine when a milestone is reached.
Benchmarking is both an engineering and a business problem. Clearly defining the question and the goal for the benchmark helps utilize compute and engineer hours effectively. When designing a benchmark, it’s important to consider how much “bang for the buck” the system will receive from the benchmarking work. Benchmarks with wide coverage of the system’s functionality and thorough analysis of the results are expensive to design and set up, but also provide more confidence in the behaviour of the system. On the other hand, smaller benchmarks might answer narrow questions very well and help get the system closer to the goal much faster.