Tumor Phylogeny Topology Inference via Deep Learning [article]

Erfan Sadeqi Azer, Mohammad Haghir Ebrahimabadi, Salem Malikić, Roni Khardon, S. Cenk Sahinalp
2020 bioRxiv   pre-print
AbstractMotivationPrincipled computational approaches for tumor phylogeny reconstruction via single cell sequencing (SCS) typically aim to identify the most likely perfect phylogeny tree through combinatorial optimization or Bayesian inference. Because of the limitations of SCS technologies, such as frequent allele dropout and variable sequence coverage, a noise reduction/elimination process may become necessary to infer a tumor phylogeny. Such noise reduction processes may aim to correct for
more » ... e most likely/parsimonious set of false negative/false positive variant calls so as to construct a perfect phylogeny. Since these problems are NP-hard, available principled approaches for tumor phylogeny reconstruction are limited in their ability to scale up for handling emergent SCS datasets. In fact, even when the goal is to infer basic topological features of the tumor phylogeny rather than reconstructing it entirely, available techniques may be prohibitively slow. As a result, fast techniques to deduce, e.g. (i) whether the most likely tree has a linear (chain) or branching topology, or (ii) whether a perfect phylogeny is feasible from single-cell genotype matrix, without explicitly testing for the three gametes rule, are highly desirable.ResultsIn this paper we introduce deep-learning solutions to the above mentioned problems for studying tumor evolution from SCS data. After training with sufficiently many examples: (1) our fully connected neural network for differentiating linear vs branching topologies, can improve the running time of the fastest combinatorial tumor phylogeny reconstruction methods by a factor of ≥ 1000, while achieving an accuracy of ∼ 98% on simulated data including 100 cells and 100 mutations with realistic noise levels (leading to mostly false negatives) of 10 – 15%; (2) similarly, our fully connected neural network for checking whether the input data permits a perfect phylogeny, achieves an accuracy of ∼ 90% on simulated data including 10 cells and 10 mutations, with similar noise levels; (3) finally, our reinforcement learning approach for tumor phylogeny reconstruction can actually eliminate noise and obtain the PP, when false negative/false positive rate ≤ 2%, for a large fraction of evaluation data sets with varying number of cells and mutations, even when trained with fixed size data sets of only 10 cells and 10 mutations - this may be useful for future clinical applications that would employ emerging SCS technologies with lower noise levels.Availabilityhttps://github.com/algo-cancer/PhyloMContactcenk.sahinalp@nih.gov
doi:10.1101/2020.02.07.938852 fatcat:ngd6ndp4wrcxthleaf2dkceayi