Conquering computational challenges of omics data and post-ENCODE paradigms
Th e past few years have witnessed an unprecedented development of high-throughput technologies, particularly in next-generation sequencing, that measure nearly every molecule of life and its modifi cations. As observed by the keynote speaker Lior Pachter (University of California, Berkeley, USA), these include SNPs, DNA methy lation, open chromatin, all forms of RNAs, proteins, protein-DNA interactions and microRNA-mRNA interactions. Furthermore, high-throughput measurements have recently
... ed a new zenith with the publication of the ENCODE and TCGA projects, which have signifi cantly increased the capacity of biology and computation to interrogate interactions across diff erent types of molecules of life. In expanding our understanding of mecha nisms in biological systems, especially in previously unstudied non-coding regions of DNA, these developments have also inspired new challenges for accurate and eff ective analysis as we model across multiple scales and ever more data. Th e 2013 ISMB-ECCB joint conference provided a platform where pioneering solutions to these problems were presented. Here, we highlight representative works in several areas of computational biology: new algorithms for analyzing high-throughput data, regulatory network modeling, translational bioinformatics and methodologies for post-ENCODE studies. High-throughput methodology for big data Novel foundational methods address theoretical or empirical challenges associated with deep-sequencing biotechnologies. For example, the annual trend for omics data storage and analysis follows a geometric curve that far outpaces that of the famous Moore's Law for computations, paradoxically yielding a net reduction of omics analytical capacity. Michael Baym (Harvard Medical School, USA) described improvements in sequencing capacity achieved by pioneering 'compressive genomics' , which leverages a meta-alignment approach that does not require decompressing redundant consensus sequences and has accelerated search effi ciency by an order of magnitude (100% positive predictive value, 99% recall, compared with BLAST). Andrew Smith (University of Southern California, USA) provided an accurate estimation of the maximum number of distinct reads that can be obtained from a DNA library, given the read frequency distribution in limited preliminary sequencing. Th e estimation matched well with deep-read sequencing distributions from human and chimpanzee samples and can be extended to related biotechnologies such as ChIPseq. Yaron Orenstein (Tel-Aviv University, Israel) transformed the problem of designing double-stranded DNA probes for protein-binding arrays into a problem of sequence coverage that aff ords unbiased measurements and more-comprehensive protein binding assessments. Tools for effi cient and accurate analysis of newly generated high-throughput data are continuing to be developed. Henry CM Leung (Th e University of Hong Kong, China) described a de novo RNA-seq assembler, IDBA-tran, designed to remove assembly paths of de Bruijn graphs associated with sequencing errors and for merging paths caused by polymorphisms. Th e algorithm achieved both better sensitivity (more than 10%) and specifi city (more than 5%) in read-data associated with poorly expressed isoforms.