Test Case Prioritization for Black Box Testing

Bo Qu, Changhai Nie, Baowen Xu, Xiaofang Zhang
2007 Computer Software and Applications Conference (COMPSAC) Proceedings of the IEEE International  
The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the "RelATive cEntrality" (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem
more » ... tions also stem from significant covarying relationships with other variants in the data. We illustrate RATE through Bayesian Gaussian process regression, but the methodological innovations apply to other "black box" methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and two real data association mapping studies, we show that applying RATE enables an explanation for this improved performance. VARIABLE PRIORITIZATION IN BLACK BOX METHODS 959 explanation for improvement gains, we often wish to know precisely which variables are the most important-with the ultimate goals of furthering scientific understanding and performing model/feature selection [ Barbieri and Berger (2004) ]. As our main contribution we propose a "RelATive cEntrality" (RATE) measure for investigating variable importance in Bayesian nonlinear models, particularly those considered to be black box. Here, RATE identifies variables which are not just marginally important, but also those whose data associations stem from a significant covarying relationship with other variables. Our method is entirely general with respect to the modeling approach taken; the only requirement being that a method can produce uncertainty intervals for predictions. As an illustration we focus on Gaussian process modeling with Markov chain Monte Carlo (MCMC) inference. In addition we note that this variable selection approach immediately applies to other methodologies such as Bayesian neural networks [Richard and Lippmann (1991) (2006) ]. ], Bayesian additive regression trees [Chipman, George and McCulloch (2010)] and approximate inference methods like variational Bayes [Rasmussen and Williams While variable selection is the main utility for our method, we are motivated by the approach of continuous model expansion [Gelman, Hwang and Vehtari (2014) ]. The goal is to build the best fitting or optimally predictive model while searching over many variables and the interactions between them but without explicitly worrying about sparsity. Indeed, this has become a recent focus of statistical methods research, especially in terms of understanding the relative importance of subsets of candidate predictors with respect to specific predictive goals [Lin, Chan and West (2016) ]. While we believe strongly in regularization as a key ingredient in developing good statistical models, our choice of Gaussian process priors achieves robust inference without explicitly imposing a sparsity penalty. The reason to avoid sparsity constraints like the lasso is not just philosophical-as typically applied L1-regularization suffers from a lack of stability [Lim and Yu (2016), Piironen and Vehtari (2017) ], and the use of Laplacian priors too has been criticized [Carvalho, Polson and Scott (2010) ]. Simultaneously, we are also motivated by the rise of deep neural networks, which are typically wildly overparameterized, and yet, when combined with large datasets, can give quite impressive improvements to model performance. We assess our proposed approach in the context of association mapping (i.e., inference of significant variants or loci) in statistical genetics as a way to highlight data science applications that are driven by many covarying and interacting predictors. For example, understanding how statistical epistasis between genes (i.e., the polynomial terms of the variables in the genotype matrix) influence the architecture of traits and variation in phenotypes is of great interest in ge-]. However, despite studies that have detected "pervasive epistasis" in many model organisms [Horn et al. (2011) ] and improved genomic selection (i.e., phenotypic prediction) using nonlinear regression models
doi:10.1109/compsac.2007.209 dblp:conf/compsac/QuNXZ07 fatcat:gsilog7pcvdexee6x5l2bs35fe