Using Source Code and Process Metrics for Defect Prediction - A Case Study of Three Algorithms and Dimensionality Reduction

Wenjing Han, Chung-Horng Lung, Samuel Ajila
2016 Journal of Software  
Software defect prediction is very important in helping the software development team allocate test resource efficiently and better understand the root cause of defects. Furthermore, it can help find the reason why a project is failure-prone. This paper applies binary classification in predicting if a software component has a bug by using three widely used algorithms in machine learning: Random Forest (RF), Neural Networks (NN), and Support Vector Machine (SVM). The paper investigates the
more » ... estigates the applications of these algorithms to the challenging issue of predicting defects in software components. Thus, this paper combines source code metrics and process metrics as indicators for the Eclipse environment using the aforementioned three algorithms for a sample of weekly Eclipse features. In addition, this paper deals with the complex issue of data dimension and our results confirm the predictive capabilities of using data dimension reduction techniques such as Variable Importance (VI) and PCA. In our case the results of using only two features (NBD_max and Pre-defects) are comparable to the results of using 61 features. Furthermore, we evaluates the performance of the three algorithms vis-à-vis the data and both Neural Network and Random Forest turned out to have the best fit. Journal of Software reduction on the predictive results? Furthermore the work in this paper used a large and up-to-date data set from the open source Eclipse project (www.eclipse.org). The Eclipse project is popular in the open source community for software development and the data for the Eclipse bug information are readily available. Our approach in this paper involves pre-processing and analyzing data sets that contain 102,675 records, then adopting different dimension reduction techniques, such as evaluating variable importance and applying Principal Component Analysis (PCA). The evaluation of variable importance allows for the reduction of data size and saves computational cost, and also identifies key features for defect prediction. Furthermore, we apply PCA for dimension reduction and compare its quality of result with other algorithms that do not use dimension reduction. Also, in order to achieve high accuracy of classification, we use three algorithms -Random Forest (RF), Neural Networks (NN), and Support Vector Machine (SVM) -to train the data. Following that, we use confusion matrix and resample to compare algorithm performance, and use Receiver Operating Characteristic (ROC) curve to describe the result of probabilistic classifiers. The main contributions of this research are as follows: Firstly, we identify the key features of code metrics and process metrics for defect prediction. Secondly, we evaluate the performance of three algorithms, which can serve as a reference for choosing an effective algorithm for a specific project. Thirdly, we utilize different dimension reduction techniques (evaluating variable importance and PCA) and compare the quality of their results. The rest of the paper is organized as follows: Section 2 presents the background information and related work. Section 3 discusses the methodology and approach. Section IV presents the results of data training using the three algorithms. Section 5 analyses and discusses the experimental results. Lastly, Section 6 concludes the paper and outlines directions for future work. Background and Related Work Neural Networks, Support Vector Machine, and Random Forest This section briefly describes the concepts and definitions of the three machine learning algorithms used in this research work. The related works are also presented. Neural networks A neural network (NN) is a two-stage regression or classification model, typically represented by a network diagram [5] . Several variants of neural network classifier (algorithm) exist, some of which are; feed-forward, back-propagation, time delay and error correction neural network classifier. NN are popular due to their flexibility and adaptive capabilities, making them ideal algorithms for non-linear optimization. Russell (1993) [6] detailed the four main steps of a neural network algorithm:  Processing units denoted by uj, with each uj possessing a certain activation level aj (t) at any point time t.  Weighted interconnections between different processing units, which are responsible for determining how the activation of one unit leads to an input for a corresponding unit.  An activation rule which operates on a set of input signals at a specific unit to produce a new output signal and activation.  A learning rule, specifying how to adjust the weights for a given input/output pair. This one is optional. The starting values in NN for weights are chosen to be random values near zero so that the model starts out nearly linear, and becomes nonlinear as the weights increase. Use of exact zero weights lead to zero derivatives and perfect symmetry, and the algorithm never moves. Conversely, starting with large weights often leads to poor solutions Often neural networks have too many weights and will over-fit the data at the global minimum. Therefore, 884 Volume 11, Number 9, September 2016 Journal of Software Appendix B5 -Three Models (NN, SVM, and RF) Resample ROC Result (
doi:10.17706/jsw.11.9.883-902 fatcat:4xxeaepxdzgzjpevqmh7yzih64