MSc Software Testing MSc Prófun hugbúnaðar Fyrirlestrar 43 & 44 Evaluating Test Driven Development 15/11/2007 Dr Andy Brooks 1
Case Study Dæmisaga Reference Evaluating Advantages of Test Driven Development: a Controlled Experiment with Professionals, Gerardo Canfora et. Al., Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering (ISESE 06), pp 364-371, 2006. ACM 15/11/2007 Dr Andy Brooks 2
1. INTRODUCTION Test Driven Development (TDD) First the developer defines the classes and their interfaces. Then the developer writes a test suite for each class which includes assertions required to verify method behaviour. Then the developer writes method bodies and executes tests. If a test fails, the developer changes the code to remove the bug. The process ends when all the tests pass. 15/11/2007 Dr Andy Brooks 3
A quick and dirty guide to JUnit public class Math { static public int add(int a, int b) { return a + b; } } import junit.framework.*; public class TestMath extends TestCase { public void testadd() { int num1 = 3; int num2 = 2; int total = 5; int sum = 0; sum = Math.add(num1, num2); assertequals(sum, total); } } from http://www.jaredrichardson.net/blog/ 15/11/2007 Dr Andy Brooks 4
1. INTRODUCTION TDD advantages Test documentation is within the code. developers do not need to search for it The tests provide an unambiguous quality indicator for the code. a test either passes or fails We believe that: (i) TDD is more time consuming than TAC; but (ii) TDD improves the quality of unit testing. TAC testing after coding 15/11/2007 Dr Andy Brooks 5
1. INTRODUCTION Two research questions Is TDD more or less productive than TAC? Does TDD improve the quality of unit testing? accuracy and precision high accuracy but low precision high precision, but low accuracy 15/11/2007 Dr Andy Brooks 6
2. RELATED WORK A structured experiment of test-driven development. Boby George Laurie Williams. Information and Software Technology, Volume 46, Issue 5, 15 April 2004, Pages 337-342 Elsevier B.V. Abstract Test Driven Development (TDD) is a software development practice in which unit test cases are incrementally written prior to code implementation. We ran a set of structured experiments with 24 professional pair programmers. One group developed a small Java program using TDD while the other (control group), used a waterfall-like approach. Experimental results, subject to external validity concerns, tend to indicate that TDD programmers produce higher quality code because they passed 18% more functional black-box test cases. However, the TDD programmers took 16% more time. Statistical analysis of the results showed that a moderate statistical correlation existed between time spent and the resulting quality. Lastly, the programmers in the control group often did not write the required automated test cases after completing their code. Hence it could be perceived that waterfall-like approaches do not encourage adequate testing. This intuitive observation supports the perception that TDD has the potential for increasing the level of unit testing in the software industry 15/11/2007 Dr Andy Brooks 7
2. RELATED WORK Test-driven development as a defect-reduction practice. Williams, L., Maximilien, E.M., and Vouk, M. 14th International Symposium on Software Reliability Engineering (ISSRE 03), pp 34-45, IEEE Abstract Test-driven development is a software development practice that has been used sporadically for decades. With this practice, test cases (preferably automated) are incrementally written before production code is implemented. Test-driven development has recently re-emerged as a critical enabling practice of the extreme programming software development methodology. We ran a case study of this practice at IBM. In the process, a thorough suite of automated test cases was produced after UML design. In this case study, we found that the code developed using a test-driven development practice showed, during functional verification and regression tests, approximately 40% fewer defects than a baseline prior product developed in a more traditional fashion. The productivity of the team was not impacted by the additional focus on producing automated test cases. This test suite aids in future enhancements and maintenance of this code. The case study and the results are discussed in detail. 15/11/2007 Dr Andy Brooks 8
2. RELATED WORK Experiment about test-first programming. Muller, M.M. and Hagner, O. Software, IEE Proceedings, 2002, Vol149(5), pp.131-136 IEEE Abstract Test-first programming is one of the central techniques of extreme programming. Programming test-first means (i) write down a test-case before coding and (ii) make all the tests executable for regression testing. Thus far, knowledge about test-first programming is limited to experience reports. Nothing is known about the benefits of test-first compared to traditional programming (design, implementation, test). This paper reports an experiment comparing. test-first to traditional programming. It turns out that test-first does not accelerate the implementation, and the resulting programs are not more reliable, but test-first seems to support better program understanding. 15/11/2007 Dr Andy Brooks 9
2. RELATED WORK A prototype empirical evaluation of test driven development. Geras, A., Smith, M. and Miller, J. 10th International Symposium on Software Metrics (METRICS 04), 2004. pp. 405-416 IEEE Abstract Test driven development (TDD) is a relatively new software development process. On the strength of anecdotal evidence and a number of empirical evaluations, TDD is starting to gain momentum as the primary means of developing software in organizations worldwide. In traditional development, tests are for verification and validation purposes and are built after the target product feature exists. In test-driven development, tests are used for specification purposes in addition to verification and validation. An experiment was devised to investigate the distinction between test-driven development and traditional, test-last development from the perspective of developer productivity and software quality. The results of the experiment indicate that while there is little or no difference in developer productivity in the two processes, there are differences in the frequency of unplanned test failures. This may lead to less debugging and more time spent on forward progress within a development project. As with many new software development technologies however, this requires further study, in particular to determine if the positive results translate into lower total costs of ownership. 15/11/2007 Dr Andy Brooks 10
2. RELATED WORK Towards empirical evaluation of test-driven development in a university environment. Pancur, M., Ciglaric, M.,Trampus, M. and Vidmar, T. EUROCON 2003. Computer as a Tool. Vol 2, pp. 83-86 IEEE Abstract Test driven development (TDD) is an agile software development technique and it is one of the core development practices of Extreme programming (XP). In TDD, developers write automatically executable tests prior to writing the code they test. We ran a set of experiments to empirically assess different parameters of the TDD. We compared TDD to a more "traditionally" oriented iterative test-last development process (ITL). Our preliminary results show that TDD is not substantially different from ITL and our qualitative findings about a development process are different from results obtained from other researches. 15/11/2007 Dr Andy Brooks 11
3. THE EXPERIMENT The two experimental hypotheses H 01 : there is no difference in the productivity between TDD and TAC. H 02 : there is no difference in quality of unit tests between TDD and TAC. quality 15/11/2007 Dr Andy Brooks 12
3. THE EXPERIMENT Subjects 28 company employees Soluziona Software Factory all had at least one year with the company All with a BSc Computer Science. All with 5 years of Java experience. All with experience of several software engineering projects. All with a wide knowledge of programming and databases. But no previous experience of TDD. 3 hours of TDD training was performed before the experiment. 15/11/2007 Dr Andy Brooks 13
3. THE EXPERIMENT Experimental platform Java Eclipse IDE JUnit Andy notes: no version numbers? 15/11/2007 Dr Andy Brooks 14
3. THE EXPERIMENT Experimental task The subjects were required to write a program to act as a TextAnalyzer for a supplied piece of text. The first requirement was to calculate the frequency of the words in the text and the position of their first occurrences. The second requirement was to calculate the maximum and minimum distance between two words indicated by the user. see the article s appendix for detailed descriptions 15/11/2007 Dr Andy Brooks 15
3. THE EXPERIMENT Experimental forms, examples Subjects completed two forms, one for each experimental run. The End Time was recorded by the subjects when the tests succeeded. 15/11/2007 Dr Andy Brooks 16
3. THE EXPERIMENT Variables MeanTPA mean time per assertion MeanTime mean time taken by subjects for testing TotalTime total time taken by a subject MeanAPM mean assertions per method AssertTot total number of assertions in a project H 02 So quality is being assessed by simply counting assertions... 15/11/2007 Dr Andy Brooks 17
3. THE EXPERIMENT Table 1. The Experimental Design - within subjects and counter balanced - Two runs, each lasting five hours. Each subject implemented both requirements. Each subject used TDD then TAC or TAC then TDD. The training session on TDD included a seminar and lab exercises. 15/11/2007 Dr Andy Brooks 18
4. ANALYSIS OF DATA Figure 1. 4.1 Descriptive Statistics - requirements 1 and 2 considered together - Employees = 28 Experimental runs = 2 Time Per Assertion Assertions Per Method TDD takes more time. TDD seems to result in only a few more assertions. 1,75 more assertions per project Andy says: MeanTPA seems ill-defined here. 15/11/2007 Dr Andy Brooks 19
4. ANALYSIS OF DATA Figure 1. 4.1 Descriptive Statistics - requirements 1 and 2 considered together - TDD is said by the authors to foster greater testing precision by increasing the number of assertions that are written. 1,75 more assertions per project TDD is said by the authors to foster greater testing accuracy because more time was taken identifying completely equivalence classes. 15/11/2007 Dr Andy Brooks 20
4. ANALYSIS OF DATA Figure 2. 4.1 Descriptive Statistics - requirements 1 and 2 considered together - TDD is said to be more predictable but the standard deviations for TDD variables are larger than for TAC variables! (See also Table 7.) Andy asks: have outlying data points not been removed? 15/11/2007 Dr Andy Brooks 21
The TDD box for AssertTot is much bigger than TAC. The TDD box for TimeTot is a little smaller than TAC but TDD has 4 outlying values as opposed to the one for TAC. 15/11/2007 Dr Andy Brooks 22
See Table 3. 4,2 Hypotheses testing Mann-Whitney tests were used because the data was not normal. The null hypothesis H 01 was rejected. TDD takes more time than TAC. TDD on average was 50 minutes longer per project. The null hypothesis H 02 was not rejected. The differences in the number of assertions could have arisen by chance. 1,75 more assertions per project was not statistically significant 15/11/2007 Dr Andy Brooks 23
4.3 Lessons learned to improve the experimental design. The authors suggest it would be useful to examine code quality as they believed code quality was improved using TDD. The authors suggest also using larger time windows as applying TDD properly can be time consuming. 15/11/2007 Dr Andy Brooks 24
5. Internal validity issues A within-subjects design helps reduce differences caused by subject variability. Subject variability was also controlled by using subjects who all had a similar professional background and who all received training in TDD and JUnit at the same seminar. Requirements 1 and 2 were designed to be as independent as possible to reduce learning effects between the two experimental runs. Mann-Whitney tests found no evidence of learning effects between the two experimental runs. See Table 4. 15/11/2007 Dr Andy Brooks 25
5. Internal validity issues Fatigue effects were controlled by holding training and the two experimental runs on three separate but consecutive days. Fatigue effects were not detected. some subjects asked for a longer time Subjects were motivated to take part since learning about TDD and JUnit could benefit them in their daily work. Both experimental runs were supervised to prevent subjects working together or otherwise sharing solutions. Mann-Whitney tests found no statistically significant differences between the data for requirement 1 and the data for requirement 2. See Table 5. 15/11/2007 Dr Andy Brooks 26
5. External validity issues The subjects were all professionals. The use of Java, Eclipse, and JUnit is representative of industrial working environments in software development. The two requirements, however, are not comparable to real industrial projects. 15/11/2007 Dr Andy Brooks 27
6. Conclusions TDD requires more time. No statistical significant evidence was found to suggest that TDD improves the accuracy and precision of unit testing. We are convinced that TDD increases such quality aspects and that evidence might be obtained in a longer experiment... 15/11/2007 Dr Andy Brooks 28
6. Conclusions TDD is more predictable than TAC. Andy says: their data does not support this! The authors are planning to: replicate the experiment enlarge the period of observation time to 6 or even 12 months analyze code quality with regard to software maintenance 15/11/2007 Dr Andy Brooks 29
Critical commentary by Andy Why did the authors not debrief subjects to find out possible reasons for the outlying data points? A subject using TAC wrote the most assertions. A TDD subject took almost 3x the average time for one project. The claim that TDD is more predictable is simply wrong. A plot of individuals TotalTime against AssertTot might have exposed a time-accuracy trade-off. Why did the authors not compare the quality of assertion writing between TDD and TAC? 15/11/2007 Dr Andy Brooks 30