Integrative methods for analyzing big data in precision medicine

Vladimir Gligorijević, Noël Malod-Dognin, Nataša Pržulj
2016 Proteomics  
We provide an overview of recent developments in big data analyses in the context of precision medicine and health informatics. With the advance in technologies capturing molecular and medical data, we entered the area of Big Data in biology and medicine. These data offer many opportunities to advance precision medicine. We outline key challenges in precision medicine and present recent advances in data integration-based methods to uncover personalized information from big data produced by
more » ... us omics studies. We survey recent integrative methods for disease subtyping, bio-markers discovery and drug repurposing, and list the tools that are available to domain scientists. Given the ever-growing nature of these big data, we highlight key issues that big data integration methods will face. 1 https://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal 2 https://www.whitehouse.gov/precision-medicine Sub-typing and Bio-marker discovery. Also known as patient stratification, sub-typing is the task of identifying sub-populations of patients that can be used to guide treatment procedures of a given individual belonging to the sub-population, and to predict the outcomes. Sub-typing identifies endotypes, which refer to sub-types in which patients are related by similarities in their underlying disease mechanisms (i.e., to explain the diseases mechanisms) 19 , and verotypes, which refer to true populations of similar patients for treatment purposes (i.e., to predict therapies for curing the patients) 20 . However, what precisely constitutes endotypes and verotypes, as well as how they should be discovered, remains open. Despite varying definitions, sub-typing remains a classification task and an active and growing area of machine learning research (see Section 3.1). Diseases such as cancer, autism, autoimmune diseases, cardiovascular diseases and Parkinson's have all been studied through the lens of subtyping 21-23 . According to FDA, a bio-marker is any measurable diagnostic indicator that is used to assess the risk, or presence of a disease 24 . Bio-marker discovery aims at finding features that are characteristic to particular patient sub-populations (e.g., specific gene mutations in tumour tissues, specific miRNAs, metabolites, etc.). The goal is that an individual is only tested for bio-markers to decide whether or not she/he belongs to a specific patient sub-type. Bio-markers are considered key to improving health-care and lowering medical costs 25 . Drug repurposing and personalised treatment. Drug repurposing refers to the identification and development of new uses for the existing, or abandoned pharmacotherapies. Capitalising on already known drugs allows for reducing the cost of developing pharmacotherapies compared with de novo drug discovery and development 26 . With the availability of various omics data, computational predictions of new drug candidates for repurposing have necessitated the development of many new methods for data integration (see Section 3.2). Drug repurposing is not only about identifying new targets for known drugs; preclinical evaluations also include predicting therapeutic regimens (i.e., dose and frequency) and safety of the treatment (i.e., side effects). Bringing together patient sub-typing and precise prediction of therapeutic treatment outcomes is key for deriving personalised treatments. For example, the American Society of Clinical Oncology estimates that testing colon cancer patients for mutations in K-RAS gene would save $604 million in drug costs annually; since patients with these mutations do not respond well to EGF inhibitors, it is preferable to avoid giving them an inefficient and potentially toxic treatment, which is also very expensive ($100,000 per treatment) 3 . In this paper, we give an overview of the available methods for analysing large and diverse biomedical data, introduce concepts of data integration and classification, and elaborate on the successes and limitations of Big Data approaches in precision medicine. Big Data Avalanche of Omics data With the recent advances in biomedical data capturing technologies, omics sciences produce ever increasing amounts of biomedical data. We briefly present key available omics data types, which are illustrated in Figure 1 . Genomics and exomics. Genomics is a part of genetics that focuses on capturing whole genomes. Historically, the Human Genome Project required 12 years and $3 billion to capture the first human genome, with a final release in 2003 reporting about 20,500 genes 9 . The first commercial next generation sequencer (NGS), the Roche GS-FLX 454 (released in 2004), allowed capturing the second human genome in two months 27 . In comparison, a modern NGS such as the Illumina HiSeq X is capable of producing up to 16 human genomes worth of data per three-day run. Note that only 1-2% of a human's genetic material codes for genes, in DNA regions called exons. Exomics, which focuses on these smaller regions, leads to quicker and cheaper sequencing 28, 29 . Recently, the ability to perform sequencing of individual cells has provided novel insights into human biology and diseases 30, 31 . Heterogeneity in DNA sequence from one cell to another has unveiled the concept of mosaicism, i.e., the presence of two or more populations of cells with different genotypes in one individual 32 . Cancer in particular has been studied through the lens of genomic variation to find driver mutations. Epigenomics. Epigenomics is the study of the complete set of epigenetic modifications of the genetic material of a cell. These reversible modifications on DNA or histones affect gene expression and thus play a major role in gene regulation. High throughput methods, such as ChipSeq and Bisulfit sequencing, allow for detection of epigenetic modifications, such as DNA methylation, histone modification and chromatin structure 33, 34 . Epigenomics findings are cell-type specific and epigenetic reprogramming has a clear role in cancer 35, 36 . Transcriptomics. As opposed to DNA sequence, which is relatively static 37 , RNA reflects the dynamic state of a cell. Transcriptomics aims at measuring the amount of transcribed genetic material over time. It includes both coding and non-coding RNAs, whose functions are sometimes unknown 38 . Co-expressed genes (i.e., with similar expression patterns over time) 3 http://www.asco.org/press-center/advances-treatment-gastrointestinal-cancers-0 2/19 Proteomics and interactomics. While transcriptomics considers all transcribed RNAs, proteomics focuses on the produced proteins, after all post-translational modifications (e.g., phosphorylation, glycolysation and lipidation). The human proteome is several order of magnitude larger than the human genome; because of alternative promoters, alternative splicing, and mRNA editing, the ≈ 25,000 human genes lead to ≈ 100,000 transcripts; with more than 300 different types of post-translational modifications, the number of resulting proteins is estimated to be larger than 1,800,000 43 . Hight-throughput capture of protein sequences is done via mass spectrometry experiments 44 . Interactions amongst proteins, or between proteins and other molecules, are captured with high-throughput techniques, such as yeast-two-hybrid 45 and affinity-captured coupled with mass spectrometry 46 . Interactomes and protein-protein interactions in particular, were successfully used to identify evolutionarily conserved pathways, complexes and functional orthologs 47-49 . Metabolomics, glycomics and fluxomics. A metabolite is any substance produced or consumed during metabolism (all chemical processes in a cell). Metabolomics studies all chemical processes involving metabolites 50 . Metabolic profiles are measured with mass-spectrometry and nuclear magnetic resonance spectrometry. Glycomics is the branch of metabolomics that studies glycomes, the sets of all sugars -free or in more complex molecules such as glycoproteins -in cells. Glycosylation is the most intensive and complex post-translational modification of proteins and glycans are known to be involved in cell growth and development 51 , in the immune system 52 , in cell-to-cell communication 53 , in cancer and microbial diseases 54, 55 . Fluxomics refers to a range of methods in experimental and computational biology that attempt to identify, or predict the rates of metabolic reactions in biological systems 56 . Phenomics and exposomics.
doi:10.1002/pmic.201500396 pmid:26677817 fatcat:rwqiuxxgmffrppkz2ccj7ffm5m