On the number and nature of faults found by random testing

I. Ciupa, A. Pretschner, M. Oriol, A. Leitner, B. Meyer
2011 Software testing, verification & reliability  
Intuition suggests that random testing should exhibit a considerable difference in the number of faults detected by two different runs of equal duration. As a consequence, random testing would be rather unpredictable. This article first evaluates the variance over time of the number of faults detected by randomly testing object-oriented software that is equipped with contracts. It presents the results of an empirical study based on 1215 h of randomly testing 27 Eiffel classes, each with 30
more » ... of the random number generator. The analysis of over 6 million failures triggered during the experiments shows that the relative number of faults detected by random testing over time is predictable, but that different runs of the random test case generator detect different faults. The experiment also suggests that the random testing quickly finds faults: the first failure is likely to be triggered within 30 s. The second part of this article evaluates the nature of the faults found by random testing. To this end, it first explains a fault classification scheme, which is also used to compare the faults found through random testing with those found through manual testing and with those found in field use of the software and recorded in user incident reports. The results of the comparisons show that each technique is good at uncovering different kinds of faults. None of the techniques subsumes any of the others; each brings distinct contributions. This supports a more general conclusion on comparisons between testing strategies: the number of detected faults is too coarse a criterion for such comparisons-the nature of faults must also be considered. How many faults are detected? If a strategy is used till the end of generating tests [1], how much does it cost to generate and execute the tests? A second way is to perform relative assessments by comparing one strategy with others. Many researchers have followed the latter path, as witnessed by the large (and here necessarily incompletely cited) body of work [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] . Some studies have provided analytical answers such as subsumption relationships; others have focused on the number of faults experimentally detected by the different strategies. In sum, there is almost no conclusive evidence that one testing strategy would clearly outperform another one in terms of the number of detected faults. Arguably, the least that one would expect from a testing strategy is that it performs better than random testing. Intuitively, random testing is the simplest form of generating tests and (seemingly) does not require too much intellectual or computational effort. Indeed, the random generation of test input data is attractive because it is widely applicable and cheap, both in terms of implementation effort and execution time. Yet, in addition to the input data, test cases also contain an expected output part. Because it depends on a specific input, the expected output cannot be generated at random. However, it can be provided at different levels of abstraction [12] . One extreme possibility is to specify the expected output as abstractly as 'no exception is thrown.' In an admittedly rough manner, this solves the oracle problem by reducing testing to robustness testing: random test case generation boils down to picking elements from the input domain and adding 'no exception' as expected output. Testing object-oriented programs is slightly more challenging. This is because the input domain may consist of arbitrarily complex objects: picking random elements from the set of integers is obviously simpler than generating arbitrary electronic health records. Most routines (methods) defined for a health record are likely to be applicable only if the health record exhibits certain characteristics-for instance, a comparison of two diagnoses at least requires the existence of the two diagnoses. As a consequence, generating objects to use as input to a routine is a non-trivial task. This article analyzes one particular flavour of random testing for O-O software both in absolute and relative terms. In a first step, it analyzes several characteristics of random testing itself, namely its predictability in terms of both number and nature of detected faults, the average time it takes to detect a first failure, and the distribution of detected faults. These questions are motivated by the intuition that two distinct runs of a random test case generator-technically speaking, with two different seeds for the random number generator-should yield different results. It turns out that in the experiments, random testing is likely to detect the first failure within 30 s; that is it is highly predictable in terms of the relative number of detected faults; but that two different runs of a random testing tool are likely to reveal different faults. From an engineer's point of view, a high variance of random testing means low predictability of the process-which immediately reduces its value. One might argue that random testing can be performed overnight and when spare processor cycles are available; the sheer amount of continuous testing would then compensate for any potential variance. However, arbitrary computation resources may not be available, and insights into the efficiency of a testing strategy are useful from the management perspective: such numbers make it comparable with other strategies. The finding that different runs of the random testing tool find approximately the same number yet different faults suggests that the number of faults is too coarse a criterion for the comparison of testing strategies. Consequently, a second part of the article compares the nature of faults revealed by random testing with the nature of faults revealed by manual testing and user incident reports. In these experiments, random testing turns out to be neither better nor worse than the other strategies. Each strategy finds different faults, which suggests that random testing should be used as a complement to these strategies, rather than as a competing technology. More concretely, in terms of the number of faults detected, it first examines how similar the results of different test sessions of the same duration but with different seeds of the random number generator are. It then addresses the issue of predictability of random testing. Second, it sheds light on the nature of faults that random testing finds, in particular, when compared with manual testing and with user incident reports. The experiments presented here evaluate Eiffel programs. One distinctive feature of Eiffel programs is that they contain embedded executable specifications in the form of contracts § . Routine postconditions are one type of contracts. These naturally lend themselves to be used as oracles, with a level of abstraction somewhere in-between the concrete output and the abstract absence of exceptions [12] . Randomly generating test cases for Eiffel programs hence consists of (1) generating input objects for a routine to be tested and (2) adding the postcondition as the expected output. The experiments use the AutoTest tool [14] to investigate the performance of random testing. AutoTest performs fully automated testing of contracted Eiffel programs: it calls the routines of the classes under test with randomly generated inputs (objects), and, if the preconditions of these routines are satisfied, it checks if contracts are fulfilled while running the tests. Any contract violation that occurs or any other thrown exception signals a fault. AutoTest's strategy for creating inputs is not purely random: it is randomly combined with limit testing, as explained in Section 2. Previous experiments [15] have shown that this strategy is much more effective at uncovering faults than purely random testing at no extra cost in terms of execution time. It is thus more relevant to investigate the more effective strategy. As this strategy is still not only random-based but also uses special predefined values (which have a high impact on the results), in the rest of this article we use the term random + testing to refer to it. 1.1.1. Predictability of random + testing. The experiment for investigating the predictability of random + testing consisted of generating and running tests with AutoTest for 27 classes from a widely used Eiffel library, which was not modified in any way. Each class was tested for 90 min. To assess the predictability of the process, testing sessions ran for each class 30 times with different seeds for the pseudo-random number generator. The main results are the following: § The widely spread view that developers do not see the advantages of contracts and will not go through the trouble of writing them is contradicted by a broad empirical study [13] that shows that programmers do write contracts, even if not complete ones.
doi:10.1002/stvr.415 fatcat:npghahfkcncqpps5rw3nmld62i