A parsimonious threshold-independent protein feature selection method through the area under receiver operating characteristic curve

Z. Wang, Y.-c. I. Chang, Z. Ying, L. Zhu, Y. Yang
2007 Bioinformatics  
Motivation: Protein expression profiling for differences indicative of early cancer holds promise for improving diagnostics. Due to their high dimensionality, statistical analysis of proteomic data from mass spectrometers is challenging in many aspects such as dimension reduction, feature subset selection as well as construction of classification rules. Search of an optimal feature subset, commonly known as the feature subset selection (FSS) problem, is an important step towards disease
more » ... cation/diagnostics with biomarkers. Methods: We develop a parsimonious threshold-independent feature selection (PTIFS) method based on the concept of area under the curve (AUC) of the receiver operating characteristic (ROC). To reduce computational complexity to a manageable level, we use a sigmoid approximation to the empirical AUC as the criterion function. Starting from an anchor feature, the PTIFS method selects a feature subset through an iterative updating algorithm. Highly correlated features that have similar discriminating power are precluded from being selected simultaneously. The classification rule is then determined from the resulting feature subset. Results: The performance of the proposed approach is investigated by extensive simulation studies, and by applying the method to two mass spectrometry data sets of prostate cancer and of liver cancer. We compare the new approach with the threshold gradient descent regularization (TGDR) method. The results show that our method can achieve comparable performance to that of the TGDR method in terms of disease classification, but with fewer features selected. Availability: Supplementary Material and the PTIFS implementations are available at
doi:10.1093/bioinformatics/btm442 pmid:17878205 fatcat:vc4y5ztaf5hy3jzjiwf4fxjteq