Regression Benchmarking with Simple Middleware Benchmarks

Similar documents
ENVIRONMENTAL REINFORCEMENT LEARNING: A Real-time Learning Architecture for Primitive Behavior Refinement

CHAPTER 4 CONTENT LECTURE 1 November :28 AM

Trace-Context Sensitive Performance Profiling for Enterprise Software Applications

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

MODELING AN SMT LINE TO IMPROVE THROUGHPUT

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Comment on McLeod and Hume, Overlapping Mental Operations in Serial Performance with Preview: Typing

Correlation vs. Causation - and What Are the Implications for Our Project? By Michael Reames and Gabriel Kemeny

Comparison of volume estimation methods for pancreatic islet cells

A NEW, ADVANCED HIGH- THROUGHPUT SYSTEM FOR AUTOMATED INHALER TESTING

ADVANCED TECHNIQUES FOR THE VERIFICATION AND VALIDATION OF PROGNOSTICS & HEALTH MANAGEMENT CAPABILITIES

J2.6 Imputation of missing data with nonlinear relationships

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

IS SOURCE-CODE ISOLATION VIABLE FOR PERFORMANCE CHARACTERIZATION?

A Brief Introduction to Bayesian Statistics

Self-aware Early Warning Score System for IoT-Based Personalized Healthcare

Reliability of feedback fechanism based on root cause defect analysis - case study

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

John Quigley, Tim Bedford, Lesley Walls Department of Management Science, University of Strathclyde, Glasgow

Assignment Question Paper I

Feasibility Evaluation of a Novel Ultrasonic Method for Prosthetic Control ECE-492/3 Senior Design Project Fall 2011

Reinforcement Learning : Theory and Practice - Programming Assignment 1

Assignment 4: True or Quasi-Experiment

Computer Science 101 Project 2: Predator Prey Model

AQC93, 47 th Annual Quality Congress, Boston, Massachusetts, May 24-26, 1993

Differences of Face and Object Recognition in Utilizing Early Visual Information

TC65B, WG6. IEC Industrial Process Control Systems Guideline for evaluating process control systems. Micaela Caserza Magro Paolo Pinceti

Impact Evaluation Toolbox

Confidence Intervals On Subsets May Be Misleading

Certification in Structural Health Monitoring Systems

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Paper 1 (1827/01 Full Course) - Factors Affecting Participation and Performance

Various Approaches to Szroeter s Test for Regression Quantiles

Understanding Correlations The Powerful Relationship between Two Independent Variables

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

Draft Broadcasting Services (Television Captioning) Standard 2013

lab exam lab exam Experimental Design Experimental Design when: Nov 27 - Dec 1 format: length = 1 hour each lab section divided in two

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

Improved Intelligent Classification Technique Based On Support Vector Machines

Automated Detection of Performance Regressions Using Regression Models on Clustered Performance Counters

Ability to link signs/symptoms of current patient to previous clinical encounters; allows filtering of info to produce broad. differential.

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Evolutionary Programming

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Remarks on Bayesian Control Charts

Neuro-MEP-Micro EMG EP. 2-Channel Portable EMG and NCS System with a Built-in Miniature Dedicated Keyboard. EMG according to international standards

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

Chapter 1. Introduction

Art to the aid of technology

Computational Neuroscience. Instructor: Odelia Schwartz

Placebo and Belief Effects: Optimal Design for Randomized Trials

Pythia WEB ENABLED TIMED INFLUENCE NET MODELING TOOL SAL. Lee W. Wagenhals Alexander H. Levis

Sparse Coding in Sparse Winner Networks

The RoB 2.0 tool (individually randomized, cross-over trials)

Where No Interface Has Gone Before: What Can the Phaser Teach Us About Label Usage in HCI?

Basic Statistics for Comparing the Centers of Continuous Data From Two Groups

Artificial Intelligence AI for Smarter Healthcare

Six Sigma Glossary Lean 6 Society

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

RAG Rating Indicator Values

PINPOINTING RADIATION THERAPY WITH THE PRECISION OF MR.

arxiv: v1 [stat.ml] 23 Jan 2017

Economic Value Lab Efficiencies Safety Clinical Outcomes. Visit BD Booth AACC/ASCLS 2009 Annual Meeting

Introduction to Experiment Design

MBA SEMESTER III. MB0050 Research Methodology- 4 Credits. (Book ID: B1206 ) Assignment Set- 1 (60 Marks)

Measuring Focused Attention Using Fixation Inner-Density

Cognitive Strategies and Eye Movements for Searching Hierarchical Displays

A Data Mining Approach for Signal Detection and Analysis

Title:Mixed-strain Housing for Female C57BL/6, DBA/2, and BALB/c Mice: Validating a Split-plot Design that promotes Refinement and Reduction

n Outline final paper, add to outline as research progresses n Update literature review periodically (check citeseer)

Day 11: Measures of Association and ANOVA

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

(Visual) Attention. October 3, PSY Visual Attention 1

To open a CMA file > Download and Save file Start CMA Open file from within CMA

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

Agent-Based Models. Maksudul Alam, Wei Wang

Supplementary materials for: Executive control processes underlying multi- item working memory

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Ultrasonic Measuring Stations vs. Mechanical Stadiometers

Re: ENSC 370 Project Gerbil Functional Specifications

Enhanced Asthma Management with Mobile Communication

Illumina 3D Portfolio

Name of the paper: Effective Development and Testing using TDD. Name of Project Teams: Conversion Team and Patient Access Team.

Background Information

Continuous/Discrete Non Parametric Bayesian Belief Nets with UNICORN and UNINET

Effects of Overweight Samples and Rounding of Grade Percentages on Peanut Grades and Prices

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Probability and Statistics Chapter 1 Notes

Using Probabilistic Methods to Optimize Data Entry in Accrual of Patients to Clinical Trials

Course summary, final remarks

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Not all NLP is Created Equal:

Detecting Cognitive States Using Machine Learning

How To Optimize Your Training For i3 Mindware v.4 And Why 2G N-Back Brain Training Works

PROF A.A. OROGADE MBBS, FWACP,

MEA DISCUSSION PAPERS

An ECG Beat Classification Using Adaptive Neuro- Fuzzy Inference System

Transcription:

Regression Benchmarking with Simple Middleware Benchmarks Lubomír Bulej 1,2, Tomáš Kalibera 1, Petr Tůma 1 1 Distributed Systems Research Group, Department of Software Engineering Faculty of Mathematics and Physics, Charles University Malostranské nám. 25, 118 00 Prague, Czech Republic phone +420-221914267, fax +420-221914323 2 Institute of Computer Science, Czech Academy of Sciences Pod Vodárenskou věží 2, 182 07 Prague, Czech Republic phone +420-266053831 {lubomir.bulej, tomas.kalibera, petr.tuma}@mff.cuni.cz Abstract The paper introduces the concept of regression benchmarking as a variant of regression testing focused at detecting performance regressions. Applying the regression benchmarking in the area of middleware development, the paper explains how regression benchmarking differs from middleware benchmarking in general. On a real-world example of TAO, the paper shows why the existing benchmarks do not give results sufficient for regression benchmarking, and proposes techniques for detecting performance regressions using simple benchmarks. 1. Introduction The development process of a software system is typically subject to a demand for certain level of quality assurance. One of the approaches to meet this demand is regression testing, where a suite of tests is built into the software system so that it can be regularly tested and potential regressions in its functionality detected and fixed. The complexity of the middleware has led many middleware projects to adopt some form of regression testing, as evidenced by open source middleware projects such as CAROL [1], OpenORB [3], or TAO [7] with its distributed scoreboard [6]. Focusing on functionality, however, the regression testing of middleware tends to neglect the performance aspect of quality assurance. With the notable exception of middleware projects that provide real-time or similar quality of service guarantees, performance is typically orthogonal to correct functionality and thus seen as a minor factor in quality assurance. This contrasts with the otherwise common use of middleware benchmarking to satisfy the obvious need to evaluate and compare performance of numerous implementations of middleware. To remedy the existing neglect of the performance aspect in regression testing, we focus on incorporating middleware benchmarking into regression testing. Our experience from a series of middleware benchmarking projects [4][5][11][13] shows that systematic benchmarking of middleware can reveal performance bottlenecks and design problems as well as implementation errors. This leads us to believe that detailed, extensive and repetitive benchmarking can be used for finding performance regressions in middleware, thus improving the overall process of quality assurance. For obvious reasons, we refer to such middleware performance evaluation as regression benchmarking. In section 2 of the paper, we investigate the concept of regression benchmarking, explaining why and how it differs from benchmarking in general. Section 3 illustrates why the existing benchmarks do not give results sufficient for regression benchmarking, and proposes guidelines and techniques for detecting performance regressions using simple benchmarks. Section 4 outlines the future work and concludes the paper. To illustrate the individual points and proposed techniques, we use TAO [7] as a real-world example of a complex and mature middleware. Two benchmarks are used throughout the paper. Denoted as benchmark

A is a benchmark that measures the duration of a remote method invocation with an input array of 1024 unsigned long values, denoted as benchmark B is a benchmark that measures the duration of marshaling an input array of 1024 unsigned long values. All results were collected on Dell Precision 340 Workstation with Pentium 4 2.2GHz and 512MB RAM running Linux 2.4.20 with GCC 2.96. 2. Regression Benchmarking Regression benchmarking is a specialized application of benchmarking as a method of performance evaluation that is tightly integrated with the development process and fully automated. By integrating the regression benchmarking framework with the middleware development framework, new regression benchmarks can be added alongside new middleware features. The integration minimizes the cost of creating and maintaining regression benchmarks and has the added benefit of the benchmarks supporting the same platforms as the middleware. The regression benchmarks must be fully automated so that they can run unattended. The requirement of automation concerns not only the execution of the benchmarks, but also the data acquisition and the results analysis. The automated execution appears to be a simple task, with the existing remote access and scripting mechanisms being more than adequate for regression benchmarks. The automated data acquisition must be able to recognize when the regression benchmark outputs data that describe the regular behavior of the middleware as opposed to data distorted by the warm up period at the start of the benchmark. Middleware benchmarking in general either uses long warm up periods or expects the warm up periods to be set by trial and error, neither of which is acceptable for regression benchmarks. The automated results analysis remains a significant obstacle, as it must be able to detect performance regressions quickly and reliably. The longer the period between the occurrence and detection of a performance regression, the more difficult it is to find the source of the regression and the more costly it is to fix it. This requirement implies a need for benchmarks that are so short they can be run daily and so precise they can detect minuscule changes in performance. Detection of a performance regression is especially a problem for creeping performance degradation, which consists of a sequence of individually negligible changes over a long period of time. Middleware benchmarking in general relies on a manual results analysis, which is not feasible for the large quantities of results produced by regular runs of regression benchmarks. 3. Simple Regression Benchmarks The existing middleware benchmarks in general can be divided into two broad groups based on their complexity. The group of relatively simple benchmarks covers benchmarks such as [4][5][6], where an isolated feature of the middleware is tested under artificial workload. The intuitive justification for the simple benchmarks is that they provide precise results that have a straightforward interpretation. The group of relatively complex benchmarks covers benchmarks such as [8][10][12], where a set of features of the middleware is tested under real-world workload. The intuitive justification for the complex benchmarks is that they provide overall results that have a direct relationship to real-world applications. The complex benchmarks do not lend themselves to the regression benchmarking as readily as the simple benchmarks, because they are more expensive to run and because their results have a less straightforward interpretation. In this paper, we focus on the simple benchmarks. Figure 1 Results of consecutive executions of the remote method invocation benchmark (benchmark A). 3.1. Existing Simple Benchmarks A representative example of the simple middleware benchmarks is a benchmark that measures the duration of a remote method invocation by averaging the dura-

tions of several remote method invocations. The results of such a benchmark, similar to the results from [6], are on figure 1. The results on figure 1 have a relative variation of 3.8%. Similar results in [6] suggest that consecutive executions of the same simple benchmark will typically yield results with a relative variation of at least units of percents. Using a simple comparison of the results for regression benchmarking would imply a need to ignore the differences of units of percents and only identify differences of tens of percents as performance regressions. For the simple benchmarks that is clearly too low a resolution. 3.2. Minimizing Result Distortion The difference in the results of consecutive executions of the same simple benchmark can be partially attributed to interference from the operating system, consisting especially of involuntary context switches and device interrupts. Figure 2 shows the interference on a benchmark that measures the duration of a remote method invocation by marking those observations that were interrupted by involuntary context switches and device interrupts. The results were obtained by modifying the operating system to provide the necessary information. the exceptional observations as a method of minimizing the result distortion. An alternative method of minimizing the result distortion is keeping the measured operation duration below the period of the interference and thus making the chance of interference happening during the measured operation reasonably small. We can then express the results using a median rather than an average of the observations, as it is a more robust estimator that is not affected by a small number of exceptional observations. For the remote method invocation, keeping the measured operation duration below the period of the interference means measuring the low-level operations that make up the remote method invocation, such as the marshaling and unmarshaling operations, data conversion operations and dispatching in various stages of the invocation, rather than the entire remote method invocation. The duration of the low-level operations ranges from tens to hundreds of microseconds, which is well below the period of the operating system interference, ranging from tens to hundreds of milliseconds. Assuming the remote method invocation is made up of n similar low-level operations, the duration of the low-level operations will have a relative variation roughly square-root-of-n-times higher than the relative variation of the remote method invocation. This does not imply a decrease in the resolution of the regression benchmark in terms of section 3.1 though, for it is the duration of the remote method invocation rather than the duration of the low-level operations that the resolution should be related to. 3.3. Collecting Enough Observations The reliability of a simple benchmark that reports a result calculated from several observations depends on the number of observations. When estimating the median of the operation duration, we can assume the observed durations to be independent identically distributed observations and then estimate the median using order statistics. Figure 3 shows the relative precision of the median depending on the number of observations, based on an estimate of the confidence interval of the median at the 99% confidence level. Figure 2 Interrupted observations of the remote method invocation benchmark (benchmark A). The results on figure 2 only attribute some of the exceptional observations to the involuntary context switches and device interrupts. The fact that the operating system does not make the information about such interference readily available disqualifies filtering of Alternatively, we can determine the minimal number of observations necessary to ensure a precise estimate of the median using the quantile precision requirement proposed by Chen and Kelton in [2]. The requirement uses a dimensionless maximum proportion confidence half-width instead of the usual maximum absolute or relative confidence half-width. The required number of

observations n p for the fixed-sample-size procedure of estimating the p quantile of an independent identically distributed sequence is: ( ε ) 2 ( 1 ) 2 z α p p 1 2 n p 4 provide about 25 times better resolution in terms of section 3.1 than the results on figure 1. On the other hand, the high relative variation makes the results on figure 4 even more unsuitable for a simple comparison than the results on figure 1. where z 2 1-α/2 is the (1-α/2) quantile of the normal distribution, ε is the maximum proportion half-width of the confidence interval, and (1-α) is the confidence level. For a 95% confidence that the median estimator has no more than ε = 0.005 deviation from the true but unknown median, we obtain a result of n p 38416. A choice of n p = 65536 borders with 99% confidence level and is acceptable for a simple benchmark. Figure 4 Median results of consecutive executions of the marshaling benchmark (benchmark B). Figure 3 Relative precision of the median for the marshaling benchmark (benchmark B). 3.4. Still Different Results After minimizing the interference and collecting the necessary number of observations, the consecutive executions of the same simple benchmark will still yield different results. Figure 4 shows the differences on a benchmark that measures the duration of a low-level marshaling operation of a remote method invocation. The results on figure 4 have a relative variation of 8.7% with an average of 3.4 µs, compared to the relative variation of 3.8% with an average of 206 µs on figure 1. Keeping in mind that it is the duration of the remote method invocation rather than the duration of the low-level marshaling operation that the regression benchmarking ultimately monitors, the results on figure Compared to the differences in the results of consecutive executions presented on figure 1, the differences in the results on figure 4 are less due to the results being distorted and unreliable, and more due to the benchmark not having enough control over the initial state of the system to make the results repeatable across executions. For complex benchmarks, things such as the physical placement of files in a filesystem or records in a database can impact the results. Simple benchmarks deal with smaller measured operation durations, therefore even things such as the selection of physical memory pages coupled with limited memory cache associativity can become an issue. Note that this makes the accepted practice of comparing results that are averaged observations at least questionable. To compare the results correctly, it is necessary to treat each result of a benchmark execution as an observation of a random variable. We can assume the results of several consecutive benchmark executions to be a set of independent identically distributed observations with normal distribution. Under this assumption, the sets of results from multiple benchmark executions can be compared using the standard statistical tests for comparing samples from two or more normal populations. We use the two-sample variation f-test to validate the equal variance assumption, and the unmatched twosample t-test to compare the averages of a pair of benchmark result sets.

Figure 5 Results of the marshaling benchmark (benchmark B) for consecutive build versions. The results of the technique applied to a real-world example are illustrated on figure 5. The example evaluates the development progress of the marshaling mechanism in TAO for a range of build versions dated from May 19, 2003 to October 27, 2003 with a step of one week. The technique was used to compare the sets of results from multiple benchmark executions for pairs of consecutive build versions. Bold lines mark the two performance changes that were detected in the development process. Note that it would be impossible to detect performance changes by comparing the results of the individual executions. 4. Conclusion Regression benchmarking is a variant of regression testing focused at detecting performance regressions. We introduce regression benchmarking in the area of middleware benchmarking, explaining how the middleware regression benchmarking differs from middleware benchmarking in general. Selecting the broad group of relatively simple benchmarks, we illustrate why the existing benchmarks do not give results sufficient for regression benchmarking. As the next step, we present a set of guidelines on minimizing result distortion and collecting enough observations, and propose a technique for detecting performance regressions using simple benchmarks that adhere to the presented guidelines. Importantly, the technique differs from the accepted practice of comparing results that are averaged observations, which is generally incorrect. We demonstrate this on a realworld example of middleware benchmarking. Although the guidelines and techniques in the paper are sufficient to conduct regression benchmarking with simple middleware benchmarks, further work is necessary to achieve the same results for relatively complex benchmark. Complex middleware benchmarks are indispensable because they exercise multiple functions of the middleware concurrently and therefore measure effects of complex interactions among the functions. Unfortunately, the complex benchmarks are more expensive to run than the simple benchmarks, and their results have a less straightforward interpretation, especially when expressed as a simple value of throughput in a number of operations per second as in [8][10][12]. We are currently investigating the use of clustering [9] to separate the results of complex benchmarks into groups of values that lend themselves better to the automated analysis. Our work on regression benchmarking is available at http://nenya.ms.mff.cuni.cz. The current status includes a number of benchmarks and a limited support for automated execution and data acquisition, the support for automated results analysis is being added as the required techniques are developed.

5. Acknowledgements The authors would like to remember Adam Buble for his invaluable contribution to our efforts, and to thank Franklin Webber for his insightful review of this paper. This work is partially sponsored by the Grant Agency of the Czech Republic grant 102/03/0672. 6. References [1] CAROL: Common Architecture for RMI ObjectWeb Layer, http://carol.objectweb.org. [2] E. J. Chen, W. D. Kelton: Simulation-based Estimation of Quantiles, Winter Simulation Conference 99, USA, 1999. [3] The Community OpenORB Project, http://openorb.sourceforge.net. [4] Distributed Systems Research Group: Open CORBA Benchmarking Project, http://nenya.ms.mff.cuni.cz/~bench. [5] Distributed Systems Research Group: Vendor CORBA Benchmarking Project, http://nenya.ms.mff.cuni.cz/projects.phtml?p=cbench. [6] DOC Group: TAO Performance Scoreboard, http://www.dre.vanderbilt.edu/stats/performance.shtml. [7] DOC Group: The ACE Orb, http://www.dre.vanderbilt.edu/tao. [8] ECperf Specification, Version 1.1, Sun Microsystems, 2002, http://www.theserverside.com/ecperf. [9] V. Faber, Clustering and Continuous k-means Algorithm, Los Alamos Science, No. 22, 1994. [10] ObjectWeb Consortium: RUBiS: Rice University Bidding System, http://rubis.objectweb.org. [11] F. Plášil, P. Tůma, A. Buble: Charles University Response to the Benchmark RFI, OMG bench/98-10-04, 1998. [12] Transaction Processing Performance Council: TPC Benchmark Web Commerce Specification 1.8, 2002, http://www.tpc.org. [13] P. Tůma, A. Buble: Open CORBA Benchmarking, SPECTS 01, USA, 2001.