Ineffectiveness of Use of Software Science Metrics as Predictors of Defects in Object Oriented Software

Ineffectiveness of Use of Software Science Metrics as Predictors of Defects in Object Oriented Software Zeeshan Ali Rana Shafay Shamail Mian Muhammad Awais E-mail: {zeeshanr, sshamail, awais} @lums.edu.pk 1 Abstract Software science metrics (SSM) have been widely used as predictors of software defects. The usage of SSM is an effect of correlation of size and complexity metrics with number of defects. The SSM have been proposed keeping in view the procedural paradigm and structural nature of the programs. There has been a shift in software development paradigm from procedural to object oriented (OO) and SSM have been used as defect predictors of OO software as well. However, the effectiveness of SSM in OO software needs to be established. This paper investigates the effectiveness of use of SSM for: a) classification of defect prone modules in OO software b) prediction of number of defects. Various binary and numeric classification models have been applied on dataset kc1 with class level data to study the role of SSM. The results show that the removal of SSM from the set of independent variables does not significantly affect the classification of modules as defect prone and the prediction of number of defects. In most of the cases the accuracy and mean absolute error has improved when SSM were removed from the set of independent variables. The results thus highlight the ineffectiveness of use of SSM in defect prediction in OO software. 1. Introduction Software science metrics (SSM) [7], proposed by Halstead, are based on number of operators, operands and their usage and have been proposed by keeping procedural paradigm in mind. These metrics are indicators of software 1 2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. size and complexity (for example program length N and effort E measure size and complexity respectively). Earlier studies have found a correlation of software size and complexity with number of defects [17, 10] and have used size and complexity metrics as predictors of defects. Studies have used SSM for defect prediction and classification of defect prone software modules as well [17, 8, 16, 2, 9, 6, 11, 12, 13, 20, 14, 15, 18]. Fenton et al. [5] have criticized the use of SSM and other size and complexity metrics in defect prediction models because 1) neither the relationship between complexity and defects is entirely causal 2) nor are the defects a function of size. Majority of the prediction models take these two assumptions [5]. Despite the critique various studies have used SSM to study software developed in procedural paradigm [11, 12, 20] as well as object oriented paradigm [14, 3, 18]. With the shift of paradigm from procedural to object oriented (OO), metrics such as unique operands η 2, total operand occurrences N 2, program vocabulary n and program volume V do not remain effective indicators of complexity of the software. This is because of the nature of OO paradigm where software consists of many classes and each class has its own operands (attributes). Having 10-15 classes in the software each with 5-10 attributes might not make the software as complex as indicated by these operator and operand based measures. The complexity in case of OO software will depend on interaction between the objects of the classes and complexity of methods of the classes. So using SSM as predictors of defects in OO software might not be a wise decision. This paper studies the role of SSM in defect prediction and attempts to establish that the use of software science metrics [7] does not significantly contribute in: 1. classifying OO software modules as defect prone and not defect prone (binary classification) 2. predicting number of defects in OO software (numeric

Table 1. List of classification models used from WEKA[19]. BC NC Model Name Abbr. Model Name Abbr. Bayesian Bay Additive Regression AR Decision Table DTb Decision Tree DTr Intance Based IB Linear Regression LR Logistic Log Support Vector Reg. SVR classification). The paper does so by running various classification models on dataset kc1 [1] with class level data and analyzing the impact of removing SSM from the set of independent variables of the classification models. The experimental results show that removing SSM from the set of independent variables does not significantly affect the binary and numeric classification of OO software modules. As compared to the case when all the collected metrics are used for both the classifications, the number of incorrectly classified instances and the mean absolute error have improved in absence of SSM for binary and numeric classification respectively. Section 2 discusses the methodology adopted to conduct this study. Section 3 presents the experimental results. Section 4 analyzes the results and discusses the ineffectiveness of use of SSM in defect prediction studies. Section 5 concludes the paper and presents the future work. 2. Methodology The paper studies the role of SSM in defect prediction of OO software using dataset kc1 [1] which consists of class level data of a NASA project. The dataset has 145 instances and each instance has 94 attributes, which are metrics collected for that software instance. These attributes include object oriented metrics [4], metrics derived from cyclomatic complexity such as sumcy CLOMAT IC COMP LEXIT Y and metrics derived from SSM such as minn U M OP ERAN DS, avgnum OP ERANDS. A few other size metrics like LOC are also part of the 94 attributes. Total 48 metrics were derived from SSM and we applied the models listed in table 1 first using all of the 94 attributes as input to the models and then applied the same models for the 46 metrics which are not derived from SSM. The data is available in two structurally different formats. One format allows binary classification and the other allows numeric classification. We performed binary classification (BC) of modules, i.e. defect prone or not defect prone, as well as numeric classification (NC), i.e. number of defects in the modules using various classification models available in WEKA [19] and listed in table 1. The classification is done using: 1. all the metrics present in the dataset. 2. all the metrics except the SSM based metrics. Because of the structural nature of the data, we applied different models for BC and NC and recorded different performance measures. Similarly the impact of removing SSM from the set of inputs is studied using different effectiveness measures for both kinds of classifications. We ll first discus measures related to BC and then the measures related to NC. Accuracy is used as model performance measures for BC. Accuracy (Acc) is based on number of correctly classified instances (CCI), number of incorrectly classified instances (ICI) and is defined as follows: Acc = CCI CCI + ICI Effectiveness Eff i is defined to study the impact of removing SSM from the set of inputs to the i th binary classification model. Eff i is given by the following equation: (1) Eff i = Acc i,all Acc i,notssm (2) where Acc i,ssm is the accuracy of model i using all metrics and Acc i,notssm is the accuracy of model i using all metrics except SSM. Use of SSM is considered effective by model i if Eff i is above a threshold α = 0.01. Which means that use of SSM is considered effective if accuracy of the model i does not decrease more than two decimal points if the SSM are removed from the set of inputs to model i. If Eff i is a negative value, this means that the accuracy of model i has improved on removing SSM from the set of inputs. In order to measure overall effectiveness of SSM, Eff avg is used which is average of all the Eff i s. Use of SSM will be considered as effective only if Eff avg is a positive number and is greater than λ = 0.005. On the other hand, SSM cannot be considered ineffective if Eff avg does not fall below λ. Performance measures recorded for NC models are: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Square Error (RRSE) and are defined by equations 3, 4, 5 and 6 respectively: MAE = 1 n n P i A i (3) where n is total number of instances, P i is predicted number of errors in ith instance, A i is the observed value of number

of errors in ith instance. RMSE = 1 n n (P i µ) 2 (4) where µ is the mean of actual values of number of errors. RAE = P i A i A (5) i µ RRSE = n (P i A i ) 2 (A i µ) 2 (6) To study the impact of removing SSM from the set of inputs to the numeric classification model i, we have defined a measure Err i based on MAE of model i as follows: Err i = MAE i,notssm MAE i,all (7) where MAE i,notssm is the MAE of model i using all metrics except SSM and MAE i,all is the MAE of model i using all metrics. Err i should be greater than δ = 0.1 in order to consider SSM as an effective predictor of number of defects using model i. In order to check overall effectiveness of SSM in case of numeric classification, average error Err avg is defined. SSM are considered effective if average of all Err i is a positive quantity greater than ɛ = 0.05. SSM cannot be considered ineffective if Err avg does not fall below ɛ. 3. Results Table 2 shows results of binary classification of software modules. Use of SSM alongwith other available metrics to classify defect prone modules does not help in case of all the models except Bayesian classifier. Rather dropping SSM as predictors, improves CCI and model accuracy for the dataset under study. Alternatively, ICI have decreased for all these models on dropping SSM from the input set, which is a better performance as compared to the case when classification was done using all metrics including SSM. When all SSM were removed from the input of Bayesian classifier, which has the highest accuracy among all four models, number of ICI increased by 1 and accuracy of the model decreased by a factor of 0.7%. Intance based learning with 1 nearest neighbor (IB) has shown the highest gain in accuracy, which is by the factor of 4%, when SSM were not a part of input to the classifier. Results of numeric classification of modules are presented in table 3 where MAE of all the models decreased in the absence of SSM in input to the classifiers except for the case of support vector regression. SVR had the lowest MAE among all the NC models in presence of SSM and Table 2. Results of numeric classification with and without SSM [7]. Model Input Metrics CCI ICI Acc Bay All 109 36 0.751 Without SSM 108 37 0.744 Log All 99 46 0.683 Without SSM 104 41 0.717 DTb All 102 43 0.703 Without SSM 104 41 0.717 IB All 102 43 0.703 Without SSM 108 37 0.744 increase in MAE SVR is an interesting observation. Linear regression (LR) has observed a significant decrease of 0.99 in MAE in absence of SSM from the set of inputs metrics. Other three performance measures for NC, showed the same pattern as does MAE, i.e. for all the models except SVR, values of RMSE, RAE and RRSE decreased in absence of SSM. 4. Analysis and Discussion As mentioned earlier accuracies of majority of the BC models have improved in absence of SSM and all the performance measures of NC models have improved for majority of NC models as well. This section discusses the extent of improvement in performance measures of BC and NC models. First the effectiveness of SSM reported by each model is discussed on the basis of Eff i and Err i, and then overall effectiveness of SSM on the basis of Eff avg and Err avg, which are combined results of BC and NC models, is pre- Table 3. Results of numeric classification with and without SSM [7]. Model Input MAE RMSE RAE RRSE AR All 4.58 10.32 75.89% 94.59% Without SSM 4.37 9.78 72.44% 89.61% DTr All 4.92 11.23 81.51% 102.84% Without SSM 4.89 10.77 81.05% 98.71% LR All 6.59 11.17 109.08% 102.35% Without SSM 5.60 9.42 92.77% 86.25% SVR All 4.39 7.42 72.64% 67.96% Without SSM 4.66 8.96 77.16% 82.13%

Table 4. Effectiveness of SSM reported by all models sented. BC Model Eff i NC Model Err i Bay 0.007 AR -0.21 DTb -0.14 DTr -0.03 IB -0.41 LR -0.99 Log -0.34 SVR 0.27 Eff avg -0.221 Err avg -0.24 Effectiveness of SSM reported by each model and the average values of effectiveness measures are shown in table 4. First two columns of the table show that no model has reported significant decrease in its accuracy on dropping SSM, i.e. no Eff i is greater than α. Unlike other three BC models, Eff i of Baysian classifier is a positive number but since this does not exceed α, we cannot take it as an indication of effectiveness of SSM. Eff avg is less than λ as well hence we cannot call that SSM have been effective in classifying software modules as defect prone or not defect prone for the dataset under study. Eff avg is a negative term smaller than λ and prompts us to believe that SSM have not only been ineffective for this dataset, but they negatively affect the classification. Moreover, the decrease in ICI and increase in CCI on dropping SSM further indicates that SSM have a negative affect on classification of modules in kc1. In case of NC models Err i reported by SVR is greater than δ, which means that SVR has reported the effectiveness of SSM for this dataset. SVR is different from the rest of the NC models used in this study. All the used models minimize the empirical classification error, SVM at the same time also maximize the geometric margin between the classes. Dropping SSM have reduced the empirical error for all the models but it has been helpful for SVR in maximizing the margin between the classes. Values reported by rest of the NC models are less than δ. Err avg is a negative value below ɛ indicating that using SSM to predict number of defects in this data of OO software is not a wise decision. The dataset used to study the behavior of classification models in absence of SSM comprises of 145 instances. Though the number of instances are enough to conduct an initial investigation, yet the results presented here cannot be generalized. More software instances are needed to establish that SSM are ineffective defect predictors in case of OO software. 5. Conclusions and Future Work This paper studies the role of software science metrics (SSM) in defect prediction of object oriented (OO) software. Binary and numeric classification models available in WEKA are applied on dataset kc1 with class level data. The models are first applied using all the metrics available in the dataset and then removing SSM from the input and the accuracies and error values of all the models are observed. Effectiveness of SSM is measured at model level by comparing accuracies and mean absolute error of models with and without SSM. Overall effectiveness of SSM is measured by taking averages of reported error values of all models. Out of the four models used for binary classification, no model has reported SSM as effective measures to classify OO software modules as defect prone. In case of NC models support vector regression has reported the effectiveness of SSM in predicting number of defects, whereas other three models have reported negative role of SSM in predicting number of defects. Averages of reported errors of all the models show that use of SSM for classification of OO software modules and predicting number of defects does not help in this case, and errors can even be improved if SSM are dropped from the input. To verify this finding, more software instances need to be analyzed. Further study with more datasets is also required to establish that SSM are ineffective defect predictors in case of OO software. 6 Acknowledgments We would like to thank Higher Education Commission (HEC) of Pakistan and Lahore University of Management Sciences (LUMS) for funding this research. References [1] G. Boetticher, T. Menzies, and T. Ostrand. Promise repository of empirical software engineering data, 2007. [2] L. C. Briand, V. R. Basili, and C. J. Hetmanski. Developing interpretable models with optimized set reduction for identifying high-risk software components. IEEE Transactions on Software Engineering, Vol. 19(No. 11):1028 1044, November 1993. [3] V. U. B. Challagulla, F. B. Bastani, and R. A. Paul. Empirical assessment of machine learning based sofwtare defect prediction techniques. In Proceedings of 10th Workshop on Object-Oriented Real-Time Dependable Systems (WORDS 05). IEEE Computer Society, 2005. [4] S. R. Chidamber and C. F. Kemerer. A metrics suite for object oriented designs. IEEE Transactions on Software Engineering, 20(No. 6):476 493, June 1994. [5] N. E. Fenton and M. Neil. A critique of software defect prediction models. IEEE Transactions on Software Engineering, Vol. 25(No. 5):675 687, September/October 1999.

[6] S. S. Gokhale and M. R. Lyu. Regression tree modeling for the prediction of software quality. In Proceedings of The 3rd ISSAT Intl. Conference on Reliability, 1997. [7] M. H. Halstead. Elements of software science. 1977. [8] H. A. Jensen and K. Vairavan. An experimental study of software metrics for real-time software. IEEE Transactions on Software Engineering, Vol. SE-11(No. 2):231 234, February 1985. [9] T. M. Khosgoftaar, D. L. Lanning, and A. S.. Pandya. A comparative study of pattern recognition techniques for quality evaluation of telecommunications software. IEEE Journal On Selected Areas In Communications, Vol. 12(No. 2):279 291, February 1994. [10] T. M. Khosgoftaar and J. C. Munson. Predicting software development errors using software complexity metrics. IEEE Journal On Selected Areas In Communications, Vol. 8(No. 2), February 1990. [11] T. M. Khoshgoftaar and E. B. Allen. A comparative study of ordering and classification of fault-prone software modules. Empirical Software Engineering, 4:159 186, 1999. [12] T. M. Khoshgoftaar and N. Seliya. Fault prediction modeling for software quality estimation: Comparing commonly used techniques. Empirical Software Engineering, 8(No. 3):255 283, September 2003. [13] T. M. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering, 9:229 257, 2004. [14] A. G. Koru and H. Liu. An investigation of the effect of module size on defect prediction using static measures. In Proceedings of International Workshop on Predictor Models in Software Engineering (PROMISE 05). ACM Press, 2005. [15] P. L. Li, J. Herbsleb, M. Shaw, and B. Robinson. Experiences and results from initiating field defect prediction and product test prioritization efforts at abb inc. In Proceedings of The 28th International Conference on Software Engineering, ICSE 06, 2006. [16] J. C. Munson and T. M. Khosgoftaar. The detection of faultprone programs. IEEE Transactions on Software Engineering, Vol. 18(No. 5):423 434, May 1992. [17] L. M. Ottenstein. Quantitative estimates of debugging requirements. IEEE Transactions on Software Engineering, Vol. SE-5(No. 5):504 514, September 1979. [18] N. Seliya and T. M. Khoshgoftaar. Software quality estimation with limited fault data: A semi-supervised learning perspective. Software Quality Journal, 15:327 344, August 2007. [19] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham. The waikato environment for knowledge analysis (weka), 2008. [20] F. Xing, P. Guo, and M. R. Lyu. A novel method for early software quality prediction based on support vector machine. In Proceedings of The 16th IEEE International Symposium on Software Reliability Engineering. IEEE, 2005.