The Challenge of Test Data Quality in Data Processing

Christoph Becker, Kresimir Duretec, Andreas Rauber
2017 Journal of Data and Information Quality  
Models of test data quality are needed for a systematic evaluation of the fitness for purpose of individual test data sets, identify concrete shortcomings, and effectively combine data from different sources. The metrics must at least address a data sets' test coverage in relation to identified tasks; the presence and reliability of the test oracle; and the degree to which the data set approximates real-world collections. Test data adequacy is a long-recognized concern in software engineering,
more » ... ut no comprehensive quality models address the concerns of data processing. CONCLUSIONS 3. The need for robust test data sets for data processing presents challenging research questions in data and information quality. Adequate ground truth must accompany test data to provide the test oracle. Novel approaches to model-based testing use model-driven engineering technologies to synthesize test data and oracles. These seeds of the emergent area of model-driven test data generation for complex data processing tasks present a promising alternative to the prevailing approach of sampling and annotation. Robust quality models for test data sets are needed to evaluate emerging approaches and allow the systematic development of heuristics to combine sampled annotated data with synthetic generated data.
doi:10.1145/3012004 dblp:journals/jdiq/BeckerDR17 fatcat:rq4jxzkpyjh3lla4mfeceoirla