Regression Benchmarking with Simple Middleware Benchmarks Lubomír Bulej 1,2, Tomáš Kalibera 1, Petr Tůma 1 1 Distributed Systems Research Group, Department of Software Engineering Faculty of Mathematics and Physics, Charles University Malostranské nám. 25, 118 00 Prague, Czech Republic phone +420-221914267, fax +420-221914323 2 Institute of Computer Science, Czech Academy of Sciences Pod Vodárenskou věží 2, 182 07 Prague, Czech Republic phone +420-266053831 {lubomir.bulej, tomas.kalibera, petr.tuma}@mff.cuni.cz Abstract The paper introduces the concept of regression benchmarking as a variant of regression testing focused at detecting performance regressions. Applying the regression benchmarking in the area of middleware development, the paper explains how regression benchmarking differs from middleware benchmarking in general. On a real-world example of TAO, the paper shows why the existing benchmarks do not give results sufficient for regression benchmarking, and proposes techniques for detecting performance regressions using simple benchmarks. 1. Introduction The development process of a software system is typically subject to a demand for certain level of quality assurance. One of the approaches to meet this demand is regression testing, where a suite of tests is built into the software system so that it can be regularly tested and potential regressions in its functionality detected and fixed. The complexity of the middleware has led many middleware projects to adopt some form of regression testing, as evidenced by open source middleware projects such as CAROL [1], OpenORB [3], or TAO [7] with its distributed scoreboard [6]. Focusing on functionality, however, the regression testing of middleware tends to neglect the performance aspect of quality assurance. With the notable exception of middleware projects that provide real-time or similar quality of service guarantees, performance is typically orthogonal to correct functionality and thus seen as a minor factor in quality assurance. This contrasts with the otherwise common use of middleware benchmarking to satisfy the obvious need to evaluate and compare performance of numerous implementations of middleware. To remedy the existing neglect of the performance aspect in regression testing, we focus on incorporating middleware benchmarking into regression testing. Our experience from a series of middleware benchmarking projects [4][5][11][13] shows that systematic benchmarking of middleware can reveal performance bottlenecks and design problems as well as implementation errors. This leads us to believe that detailed, extensive and repetitive benchmarking can be used for finding performance regressions in middleware, thus improving the overall process of quality assurance. For obvious reasons, we refer to such middleware performance evaluation as regression benchmarking. In section 2 of the paper, we investigate the concept of regression benchmarking, explaining why and how it differs from benchmarking in general. Section 3 illustrates why the existing benchmarks do not give results sufficient for regression benchmarking, and proposes guidelines and techniques for detecting performance regressions using simple benchmarks. Section 4 outlines the future work and concludes the paper. To illustrate the individual points and proposed techniques, we use TAO [7] as a real-world example of a complex and mature middleware. Two benchmarks are used throughout the paper. Denoted as benchmark
A is a benchmark that measures the duration of a remote method invocation with an input array of 1024 unsigned long values, denoted as benchmark B is a benchmark that measures the duration of marshaling an input array of 1024 unsigned long values. All results were collected on Dell Precision 340 Workstation with Pentium 4 2.2GHz and 512MB RAM running Linux 2.4.20 with GCC 2.96. 2. Regression Benchmarking Regression benchmarking is a specialized application of benchmarking as a method of performance evaluation that is tightly integrated with the development process and fully automated. By integrating the regression benchmarking framework with the middleware development framework, new regression benchmarks can be added alongside new middleware features. The integration minimizes the cost of creating and maintaining regression benchmarks and has the added benefit of the benchmarks supporting the same platforms as the middleware. The regression benchmarks must be fully automated so that they can run unattended. The requirement of automation concerns not only the execution of the benchmarks, but also the data acquisition and the results analysis. The automated execution appears to be a simple task, with the existing remote access and scripting mechanisms being more than adequate for regression benchmarks. The automated data acquisition must be able to recognize when the regression benchmark outputs data that describe the regular behavior of the middleware as opposed to data distorted by the warm up period at the start of the benchmark. Middleware benchmarking in general either uses long warm up periods or expects the warm up periods to be set by trial and error, neither of which is acceptable for regression benchmarks. The automated results analysis remains a significant obstacle, as it must be able to detect performance regressions quickly and reliably. The longer the period between the occurrence and detection of a performance regression, the more difficult it is to find the source of the regression and the more costly it is to fix it. This requirement implies a need for benchmarks that are so short they can be run daily and so precise they can detect minuscule changes in performance. Detection of a performance regression is especially a problem for creeping performance degradation, which consists of a sequence of individually negligible changes over a long period of time. Middleware benchmarking in general relies on a manual results analysis, which is not feasible for the large quantities of results produced by regular runs of regression benchmarks. 3. Simple Regression Benchmarks The existing middleware benchmarks in general can be divided into two broad groups based on their complexity. The group of relatively simple benchmarks covers benchmarks such as [4][5][6], where an isolated feature of the middleware is tested under artificial workload. The intuitive justification for the simple benchmarks is that they provide precise results that have a straightforward interpretation. The group of relatively complex benchmarks covers benchmarks such as [8][10][12], where a set of features of the middleware is tested under real-world workload. The intuitive justification for the complex benchmarks is that they provide overall results that have a direct relationship to real-world applications. The complex benchmarks do not lend themselves to the regression benchmarking as readily as the simple benchmarks, because they are more expensive to run and because their results have a less straightforward interpretation. In this paper, we focus on the simple benchmarks. Figure 1 Results of consecutive executions of the remote method invocation benchmark (benchmark A). 3.1. Existing Simple Benchmarks A representative example of the simple middleware benchmarks is a benchmark that measures the duration of a remote method invocation by averaging the dura-
tions of several remote method invocations. The results of such a benchmark, similar to the results from [6], are on figure 1. The results on figure 1 have a relative variation of 3.8%. Similar results in [6] suggest that consecutive executions of the same simple benchmark will typically yield results with a relative variation of at least units of percents. Using a simple comparison of the results for regression benchmarking would imply a need to ignore the differences of units of percents and only identify differences of tens of percents as performance regressions. For the simple benchmarks that is clearly too low a resolution. 3.2. Minimizing Result Distortion The difference in the results of consecutive executions of the same simple benchmark can be partially attributed to interference from the operating system, consisting especially of involuntary context switches and device interrupts. Figure 2 shows the interference on a benchmark that measures the duration of a remote method invocation by marking those observations that were interrupted by involuntary context switches and device interrupts. The results were obtained by modifying the operating system to provide the necessary information. the exceptional observations as a method of minimizing the result distortion. An alternative method of minimizing the result distortion is keeping the measured operation duration below the period of the interference and thus making the chance of interference happening during the measured operation reasonably small. We can then express the results using a median rather than an average of the observations, as it is a more robust estimator that is not affected by a small number of exceptional observations. For the remote method invocation, keeping the measured operation duration below the period of the interference means measuring the low-level operations that make up the remote method invocation, such as the marshaling and unmarshaling operations, data conversion operations and dispatching in various stages of the invocation, rather than the entire remote method invocation. The duration of the low-level operations ranges from tens to hundreds of microseconds, which is well below the period of the operating system interference, ranging from tens to hundreds of milliseconds. Assuming the remote method invocation is made up of n similar low-level operations, the duration of the low-level operations will have a relative variation roughly square-root-of-n-times higher than the relative variation of the remote method invocation. This does not imply a decrease in the resolution of the regression benchmark in terms of section 3.1 though, for it is the duration of the remote method invocation rather than the duration of the low-level operations that the resolution should be related to. 3.3. Collecting Enough Observations The reliability of a simple benchmark that reports a result calculated from several observations depends on the number of observations. When estimating the median of the operation duration, we can assume the observed durations to be independent identically distributed observations and then estimate the median using order statistics. Figure 3 shows the relative precision of the median depending on the number of observations, based on an estimate of the confidence interval of the median at the 99% confidence level. Figure 2 Interrupted observations of the remote method invocation benchmark (benchmark A). The results on figure 2 only attribute some of the exceptional observations to the involuntary context switches and device interrupts. The fact that the operating system does not make the information about such interference readily available disqualifies filtering of Alternatively, we can determine the minimal number of observations necessary to ensure a precise estimate of the median using the quantile precision requirement proposed by Chen and Kelton in [2]. The requirement uses a dimensionless maximum proportion confidence half-width instead of the usual maximum absolute or relative confidence half-width. The required number of
observations n p for the fixed-sample-size procedure of estimating the p quantile of an independent identically distributed sequence is: ( ε ) 2 ( 1 ) 2 z α p p 1 2 n p 4 provide about 25 times better resolution in terms of section 3.1 than the results on figure 1. On the other hand, the high relative variation makes the results on figure 4 even more unsuitable for a simple comparison than the results on figure 1. where z 2 1-α/2 is the (1-α/2) quantile of the normal distribution, ε is the maximum proportion half-width of the confidence interval, and (1-α) is the confidence level. For a 95% confidence that the median estimator has no more than ε = 0.005 deviation from the true but unknown median, we obtain a result of n p 38416. A choice of n p = 65536 borders with 99% confidence level and is acceptable for a simple benchmark. Figure 4 Median results of consecutive executions of the marshaling benchmark (benchmark B). Figure 3 Relative precision of the median for the marshaling benchmark (benchmark B). 3.4. Still Different Results After minimizing the interference and collecting the necessary number of observations, the consecutive executions of the same simple benchmark will still yield different results. Figure 4 shows the differences on a benchmark that measures the duration of a low-level marshaling operation of a remote method invocation. The results on figure 4 have a relative variation of 8.7% with an average of 3.4 µs, compared to the relative variation of 3.8% with an average of 206 µs on figure 1. Keeping in mind that it is the duration of the remote method invocation rather than the duration of the low-level marshaling operation that the regression benchmarking ultimately monitors, the results on figure Compared to the differences in the results of consecutive executions presented on figure 1, the differences in the results on figure 4 are less due to the results being distorted and unreliable, and more due to the benchmark not having enough control over the initial state of the system to make the results repeatable across executions. For complex benchmarks, things such as the physical placement of files in a filesystem or records in a database can impact the results. Simple benchmarks deal with smaller measured operation durations, therefore even things such as the selection of physical memory pages coupled with limited memory cache associativity can become an issue. Note that this makes the accepted practice of comparing results that are averaged observations at least questionable. To compare the results correctly, it is necessary to treat each result of a benchmark execution as an observation of a random variable. We can assume the results of several consecutive benchmark executions to be a set of independent identically distributed observations with normal distribution. Under this assumption, the sets of results from multiple benchmark executions can be compared using the standard statistical tests for comparing samples from two or more normal populations. We use the two-sample variation f-test to validate the equal variance assumption, and the unmatched twosample t-test to compare the averages of a pair of benchmark result sets.
Figure 5 Results of the marshaling benchmark (benchmark B) for consecutive build versions. The results of the technique applied to a real-world example are illustrated on figure 5. The example evaluates the development progress of the marshaling mechanism in TAO for a range of build versions dated from May 19, 2003 to October 27, 2003 with a step of one week. The technique was used to compare the sets of results from multiple benchmark executions for pairs of consecutive build versions. Bold lines mark the two performance changes that were detected in the development process. Note that it would be impossible to detect performance changes by comparing the results of the individual executions. 4. Conclusion Regression benchmarking is a variant of regression testing focused at detecting performance regressions. We introduce regression benchmarking in the area of middleware benchmarking, explaining how the middleware regression benchmarking differs from middleware benchmarking in general. Selecting the broad group of relatively simple benchmarks, we illustrate why the existing benchmarks do not give results sufficient for regression benchmarking. As the next step, we present a set of guidelines on minimizing result distortion and collecting enough observations, and propose a technique for detecting performance regressions using simple benchmarks that adhere to the presented guidelines. Importantly, the technique differs from the accepted practice of comparing results that are averaged observations, which is generally incorrect. We demonstrate this on a realworld example of middleware benchmarking. Although the guidelines and techniques in the paper are sufficient to conduct regression benchmarking with simple middleware benchmarks, further work is necessary to achieve the same results for relatively complex benchmark. Complex middleware benchmarks are indispensable because they exercise multiple functions of the middleware concurrently and therefore measure effects of complex interactions among the functions. Unfortunately, the complex benchmarks are more expensive to run than the simple benchmarks, and their results have a less straightforward interpretation, especially when expressed as a simple value of throughput in a number of operations per second as in [8][10][12]. We are currently investigating the use of clustering [9] to separate the results of complex benchmarks into groups of values that lend themselves better to the automated analysis. Our work on regression benchmarking is available at http://nenya.ms.mff.cuni.cz. The current status includes a number of benchmarks and a limited support for automated execution and data acquisition, the support for automated results analysis is being added as the required techniques are developed.
5. Acknowledgements The authors would like to remember Adam Buble for his invaluable contribution to our efforts, and to thank Franklin Webber for his insightful review of this paper. This work is partially sponsored by the Grant Agency of the Czech Republic grant 102/03/0672. 6. References [1] CAROL: Common Architecture for RMI ObjectWeb Layer, http://carol.objectweb.org. [2] E. J. Chen, W. D. Kelton: Simulation-based Estimation of Quantiles, Winter Simulation Conference 99, USA, 1999. [3] The Community OpenORB Project, http://openorb.sourceforge.net. [4] Distributed Systems Research Group: Open CORBA Benchmarking Project, http://nenya.ms.mff.cuni.cz/~bench. [5] Distributed Systems Research Group: Vendor CORBA Benchmarking Project, http://nenya.ms.mff.cuni.cz/projects.phtml?p=cbench. [6] DOC Group: TAO Performance Scoreboard, http://www.dre.vanderbilt.edu/stats/performance.shtml. [7] DOC Group: The ACE Orb, http://www.dre.vanderbilt.edu/tao. [8] ECperf Specification, Version 1.1, Sun Microsystems, 2002, http://www.theserverside.com/ecperf. [9] V. Faber, Clustering and Continuous k-means Algorithm, Los Alamos Science, No. 22, 1994. [10] ObjectWeb Consortium: RUBiS: Rice University Bidding System, http://rubis.objectweb.org. [11] F. Plášil, P. Tůma, A. Buble: Charles University Response to the Benchmark RFI, OMG bench/98-10-04, 1998. [12] Transaction Processing Performance Council: TPC Benchmark Web Commerce Specification 1.8, 2002, http://www.tpc.org. [13] P. Tůma, A. Buble: Open CORBA Benchmarking, SPECTS 01, USA, 2001.