Controlling Overfitting in Classification-Tree Models of Software Quality
Source: Empirical Software Engineering, Volume 6, Number 1, March 2001 , pp. 59-79(21)
Abstract:Predicting which modules are likely to have faults during operations is important to software developers, so that software enhancement efforts can be focused on those modules that need improvement the most. Modeling software quality with classification trees is attractive because they readily model nonmonotonic relationships. In this paper, we apply the TREEDISC algorithm which is a refinement of the CHAID algorithm to build classification-tree models. Chaid-based algorithms differ from other classification-tree algorithms in their reliance on chi-squared tests when building the tree. Classification-tree models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on expected costs of misclassification, rather than the total number of misclassifications. We conducted a case study of a very large telecommunications system. A two-way analysis of variance with repetitions found that TREEDISC's significance level was highly related to overfitting, and can be used to control it. Moreover, the minimum number of modules in a leaf also influenced the degree of overfitting.
Document Type: Regular Paper
Publication date: 2001-03-01