Machine learning and large scale cancer omic data: decoding the biological mechanisms underpinning cancer [article]

Viola Fanfani, University Of Edinburgh, Giovanni Stracquadanio, Guido Sanguinetti
2022
Many of the mechanisms underpinning cancer risk and tumorigenesis are still not fully understood. However, the next-generation sequencing revolution and the rapid advances in big data analytics allow us to study cells and complex phenotypes at unprecedented depth and breadth. While experimental and clinical data are still fundamental to validate findings and confirm hypotheses, computational biology is key for the analysis of system- and population-level data for detection of hidden patterns
more » ... the generation of testable hypotheses. In this work, I tackle two main questions regarding cancer risk and tumorigenesis that require novel computational methods for the analysis of system-level omic data. First, I focused on how frequent, low-penetrance inherited variants modulate cancer risk in the broader population. Genome-Wide Association Studies (GWAS) have shown that Single Nucleotide Polymorphisms (SNP) contribute to cancer risk with multiple subtle effects, but they are still failing to give further insight into their synergistic effects. I developed a novel hierarchical Bayesian regression model, BAGHERA, to estimate heritability at the gene-level from GWAS summary statistics. I then used BAGHERA to analyse data from 38 malignancies in the UK Biobank. I showed that genes with high heritable risk are involved in key processes associated with cancer and are often localised in genes that are somatically mutated drivers. Heritability, like many other omics analysis methods, study the effects of DNA variants on single genes in isolation. However, we know that most biological processes require the interplay of multiple genes and we often lack a broad perspective on them. For the second part of this thesis, I then worked on the integration of Protein-Protein Interaction (PPI) graphs and omics data, which bridges this gap and recapitulates these interactions at a system level. First, I developed a modular and scalable Python package, PyGNA, that enables robust statistical testing of genesets' topological properties. PyG [...]
doi:10.7488/era/1915 fatcat:e7yl3yjbvfbqbdej3hkarli2u4