Pattern discovery and cancer gene identification in integrated cancer genomic data

Qianxing Mo, Sijian Wang, Venkatraman E. Seshan, Adam B. Olshen, Nikolaus Schultz, Chris Sander, R. Scott Powers, Marc Ladanyi, Ronglai Shen
2013 Proceedings of the National Academy of Sciences of the United States of America  
Large-scale integrated cancer genome characterization efforts including the cancer genome atlas and the cancer cell line encyclopedia have created unprecedented opportunities to study cancer biology in the context of knowing the entire catalog of genetic alterations. A clinically important challenge is to discover cancer subtypes and their molecular drivers in a comprehensive genetic context. Curtis et al. [Nature (2012) 486 (7403) :346-352] has recently shown that integrative clustering of
more » ... number and gene expression in 2,000 breast tumors reveals novel subgroups beyond the classic expression subtypes that show distinct clinical outcomes. To extend the scope of integrative analysis for the inclusion of somatic mutation data by massively parallel sequencing, we propose a framework for joint modeling of discrete and continuous variables that arise from integrated genomic, epigenomic, and transcriptomic profiling. The core idea is motivated by the hypothesis that diverse molecular phenotypes can be predicted by a set of orthogonal latent variables that represent distinct molecular drivers, and thus can reveal tumor subgroups of biological and clinical importance. Using the cancer cell line encyclopedia dataset, we demonstrate our method can accurately group cell lines by their cell-of-origin for several cancer types, and precisely pinpoint their known and potential cancer driver genes. Our integrative analysis also demonstrates the power for revealing subgroups that are not lineage-dependent, but consist of different cancer types driven by a common genetic alteration. Application of the cancer genome atlas colorectal cancer data reveals distinct integrated tumor subtypes, suggesting different genetic pathways in colon cancer progression. multivariate generalized linear model | multidimensional data | penalized regression A major goal of many cancer genome projects is to characterize key genetic alterations in cancer and discover therapeutic targets through comprehensive genomic profiling of the cancer genome. The Cancer Genome Atlas (TCGA) studies have unveiled the genetic landscape of several cancer types by whole-genome and whole-exome sequencing, DNA copy number profiling, promoter methylation profiling, and mRNA expression profiling in a large number of tumors (1-5). Complementary to the tumor project, the Cancer Cell Line Encyclopedia (CCLE) (6) and the Sanger cell line project (7) has cataloged a compilation of genetic and molecular data in almost 1,000 human cancer cell lines, coupled with pharmacological profiles for a large panel of anticancer drugs. These large-scale integrative genomic efforts have been geared toward comprehensively cataloging individual genomic alterations, analogous to a reverse-engineering process where thousands of individual cancer genomes are taken apart to shed light on common biological principles. Unfortunately, cancer genomes exhibit considerable heterogeneity with abnormalities occurring in different genes among different individuals, posing a great challenge to identify those genes with functional importance and therapeutic implications. Thus, there is a corresponding need for a forward-engineering process that synthesizes and integrates the information to extract biological principles from the massive amount of data to provide useful insights for advancing diagnostic, prognostic, and therapeutic strategies. In a previous publication (8) , we proposed an integrative clustering framework called iCluster. The method was recently used in a landmark study to predict novel breast cancer subtypes with distinct clinical outcomes (9), and it was found that the joint clustering of copy number and gene expression profiles resolved the considerable heterogeneity of the expression-only subgroups. Other approaches on data integration that have emerged in recent years include generalized data decomposition methods (10, 11) and nonparametric Bayesian models (12). However, two major challenges have not yet been fully addressed. First, the existing methods are not designed to include both discrete (e.g., somatic mutation) and continuous variables, thus limiting the ability to harness the full potential of large-scale integrated genomic datasets. In fact, most of the previous methods have focused on integrating only copy number and gene expression. A second challenge that has not been fully addressed lies in systematically distinguishing cancer genes that are reliable and constant features of a subtype from those that are less reliable. To address these challenges, we present a significant enhancement of the iCluster method, which we call iCluster+. The enhanced method can perform pattern discovery that integrates diverse data types: binary (somatic mutation), categorical (copy number gain, normal, loss), and continuous (gene expression) values. In this paper, we demonstrate the power of this method for integrating the full spectrum of cancer genomic data using the CCLE and TCGA colorectal cancer datasets. A key aspect of the method is to use generalized linear regression for the formulation of a joint model, with respect to a common set of latent variables that we propose represents distinct driving factors (molecular etiology and genetic pathways). Geometrically, these latent variables form a set of "principal" coordinates that span a lower dimensional integrated subspace, and collectively capture the major biological variations observed across cancer genomes. As a result, the latent variable approach enables rigorous analysis of the integrated genomic data, as we show in this report can reveal common themes that sort the tumors into distinct subgroups of biological and clinical importance. To identify genomic features that contribute most to the biological variation and thus have direct relevance for characterizing the molecular subgroups, we apply a penalized
doi:10.1073/pnas.1208949110 pmid:23431203 pmcid:PMC3600490 fatcat:vx73girquzardhalcomylnxxka