An empirical study of regression test application frequency

Jung-Min Kim, Adam Porter, Gregg Rothermel
2000 Proceedings of the 22nd international conference on Software engineering - ICSE '00  
Regression testing is an expensive maintenance process used to revalidate modified software. Regression test selection (RTS) techniques attempt to reduce the cost of regression testing by selecting and running a subset of an existing test suite. Many RTS techniques have been proposed in the research literature, and studies have shown that they can produce savings. Other studies have shown that the cost-effectiveness of RTS techniques can vary widely with various characteristics of the workloads
more » ... (programs, versions, and test suites) to which they are applied. It seems plausible, however, that another set of factors impacting the cost-effectiveness of RTS techniques involves the process by which they are applied. In particular, issues such as the frequency with which regression testing is done have a strong effect on the behavior of RTS techniques. Therefore, in earlier work an experiment was conducted to assess the effects of test application frequency on the cost-effectiveness of RTS techniques. The results exposed essential tradeoffs that should be considered when using these techniques over a series of software releases. This work, however, was limited by several threats to external validity; in particular, the subject programs utilized were relatively small. Therefore, in this work, the previous experiment has been replicated on a large, multi-version program. This second experiment largely confirms the initial findings of the first study. In particular, results indicate that the cost of using safe RTS techniques was strongly and negatively affected by testing interval; that is, as the number of changes made to the program since the previous testing session increased, the number of test cases selected rose rapidly. Conversely, results show that the effectiveness of minimization RTS techniques was strongly and positively affected; that is, as the number of changes increased, so did the effectiveness of the test suites selected by minimization. INTRODUCTION After modifying software, developers typically want to know that existing system functionality has not been adversely affected. To obtain such knowledge, developers often perform regression testing. The simplest regression testing strategy is to rerun all existing test cases. This strategy is easy to implement, but can be unnecessarily expensive, especially when changes affect only a small part of the system. Consequently, an alternative approach, regression test selection (RTS), has been extensively investigated [e.g., 1, 2,5, 6,10, 14]. With this approach only a subset of the test cases contained in a test suite are selected and rerun. Reducing the number of test cases rerun reduces regression testing costs, but may also cause fault-revealing test cases to be omitted. Since, in general, optimal test selection (i.e., selecting exactly the fault-revealing test cases) is impossible [13], the cost-benefit tradeoffs of RTS techniques are a central concern of regression testing research and practice. A common way to empirically study this problem has been to find or create base and modified versions of a system and accompanying test suites. Next, one or more RTS techniques are run and the size and effectiveness of the selected test suites are compared to the size and effectiveness of the original test suite [e.g., 4, 11, 12, 16, 17]. Empirical studies of this sort have revealed several cost-effectiveness tradeoffs between RTS techniques, but they have also revealed that the performance of RTS techniques can vary widely with characteristics of programs, modifications, and test suites [16] . Recent studies [18, 22] have thus attempted to empirically evaluate some of these sources of differences. One limitation of all this prior empirical work is that it fails to account for differences in testing processes, which may affect the cost-effectiveness of RTS techniques. In particular, previous studies of RTS techniques have all modeled regression testing as a one-time activity. In practice, however, regression testing is often a continuous process. For example, software releases often require many changes to a system with regression testing sessions interspersed between various numbers of changes rather than performed just once prior to product release. Similarly, many companies integrate system components and then regression test software changes on a monthly, weekly, or even daily basis. One can hypothesize that the amount of change made between regression testing sessions strongly affects the costs and benefits of different RTS techniques. That is, it is possible that some RTS techniques will perform less cost-effectively as the amount of changes made between regression testing sessions grows. This is because they will select increasingly larger test suites and because these suites will become increasingly less cost-effective at finding faults. If this hypothesis is correct, testing practitioners may be able to better manage and coordinate their integration and regression testing processes by altering them to accommodate the effects of change size; this in turn may result in savings in time and money. To test this hypothesis, in earlier work [20] an experiment was conducted to assess the effects of test application frequency on the costs and benefits of RTS techniques. The results exposed essential tradeoffs that should be considered when using these techniques over a series of software releases. This work, however, was limited by several threats to external validity; in particular, the subject programs utilized were relatively small. Therefore, in this work, the previous experiment is replicated on a large, multiversion program. This second experiment largely confirms the findings of the first. In particular, the cost of using safe RTS techniques was strongly and negatively affected by testing interval; that is, as the number of changes made to the program since the previous testing session increased, the number of test cases selected rose rapidly. Conversely, the effectiveness of minimization RTS techniques was strongly and positively affected; that is, as the number of changes increased, so did the effectiveness of the test suites selected by minimization. The remainder of this paper reviews the relevant background material and literature, and then presents the design and analysis of both the initial experiments and the replication of those experiments. Finally, the results of the two experiments are compared, and then conclusions and future directions for research are presented. BACKGROUND AND LITERATURE REVIEW Regression Testing Let P be a procedure or program, let P′ be a modified version of P and let T be a test suite for P. A typical regression test proceeds as follows: 1 Select T′⊆ T, a set of test cases to execute on P′. 2 Test P′ with T′, establishing P′'s correctness with respect to T′. 3 If necessary, create T″, a set of new functional or structural test cases for P′. 4 Test P′ with T″, establishing P′'s correctness with respect to T″. 5 Create T″′, a new test suite and test history for P′, from T, T′, and T″. Each of these steps involves important problems. However, this work concerns only step 1 -the regression test selection problem. Regression Test Selection Techniques Several RTS techniques have been investigated in the research literature (see [13] ). Here several classes of techniques are briefly described, and a representative example of each is presented. Retest-All. This approach reruns all test cases in T. It may be used when test effectiveness is the utmost priority with little regard for cost. Random/Ad-Hoc. Testers often select test cases randomly or rely on their prior knowledge or experience. One such technique is to randomly select a percentage of test cases from T. Minimization. These approaches (e.g., [3, 6] ) aim to select a minimal set of test cases from T that covers all modified or affected elements of P′. One such technique randomly selects test cases from T until every statement added to or modified in creating P′ is exercised by at least one test case. Safe. These approaches (e.g., [2, 14] ) select, under certain conditions, every test case in T that covers changed program entities in P′. One such technique [14] selects every test case in T that exercises at least one statement that was added to or modified in creating P′, or that has been deleted from P. Cost and Benefit Models Leung and White [8] present a model of the costs and benefits of RTS strategies. Costs are divided into two types: direct and indirect. Indirect costs include management overhead, database maintenance, and tool development. Direct costs include costs of test selection, test execution, and results analysis. Savings are simply the costs avoided by not running unselected test cases. Let T′ be the subset of T selected by a certain RTS technique M for program P, and let |T′| denote the cardinality of T′. Let s be the average cost per test case of applying M to P to select T′, and let r be the average cost per test case of running P on a test case in T and checking its result. Leung and White argue that for RTS to be cost-effective the inequality: s|T′| < r(|T| -|T′|) must hold. That is, the analysis required to select T′ should cost less than the cost of running the unselected test cases, T -T′. One limitation of this model is that it overlooks the cost of undetected faults. Since a primary purpose of testing is to detect faults, it is important to understand whether, and to what extent, test selection reduces fault detection effectiveness. To address this limitation, Malishevsky et al. [21] extend Leung and White's to factor in benefits related to fault detection effectiveness. Previous Empirical Studies Initially, cost-effectiveness, as defined by Leung and White, was the central focus of regression test selection studies. Rosenblum and Weyuker [12] applied the technique TestTube to 31 versions of the KornShell and its test suites. For 80% of the versions, their method selected all existing test cases. They note that the test suite is relatively small (16 test cases), and that many of the test cases exercise all the components of the system.
doi:10.1145/337180.337196 dblp:conf/icse/KimPR00 fatcat:qivweti2bzczdidegaaxgnrjxi