Using Machine Learning to Facilitate Classification of Somatic Variants from Next-Generation Sequencing [article]

Chao Wu, Xiaonan Zhao, Mark Welsh, Kellianne Costello, Kajia Cao, Ahmad Abou Tayoun, Marilyn Li, Mahdi Sarmady
2019 bioRxiv   pre-print
AbstractBackgroundMolecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. Here we present a machine learning-based method to distinguish artifacts from bona fide Single Nucleotide Variants (SNVs) detected by NGS from tumor specimens.MethodsA
more » ... rt of 11,278 SNVs identified through clinical sequencing of tumor specimens were collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A three-class (real, artifact and uncertain) model was developed on the training set, fine-tuned using the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label "uncertain" variants.ResultsThe optimized classifier demonstrated 100% specificity and 97% sensitivity over 5,587 SNVs of the test set. 1,252 out of 1,341 true positive variants were identified as real, 4,143 out of 4,246 false positive calls were deemed artifacts, while only 192(3.4%) SNVs were labeled as "uncertain" with zero misclassification between the true positives and artifacts in the test set.ConclusionsWe presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received a definitive label and thus were exempt from manual review. This framework could improve quality and efficiency of variant review process in clinical labs.
doi:10.1101/670687 fatcat:7agyw6pyb5fnjirvmoltdfblo4