Statistics in Biosciences
Biomedical research has been revolutionized in recent years with the rapid advancement of sequencing technologies. Unprecedented amounts and diverse types of data have been generated from different platforms. For example, the whole human genome, which has about 3 billion base pairs, can be sequenced at a 30-fold coverage under $3000, making it possible to completely identify and characterize the genetic variations carried by an individual and define all the somatic mutations in a tumor
... n. In comparison, RNA sequencing data provide a comprehensive characterization of the transcriptome of the cell population under study, including different isoforms and previously unannotated transcripts. Coupled with different molecular methods, we can also investigate the interactions between proteins and DNA, proteins and RNA, methylations, and chromatin modifications. These rich data, often called next generation sequencing (NGS) data, present both great opportunities and even greater computational and statistical challenges. With concerted efforts of many statisticians, much progress has been made to address the unique issues in the analysis of NGS data. This special issue of the SIBS highlights some recent statistical methodology developments for the analysis and interpretation of NGS data. DNA-seq: In the first paper of this special issue, Li and colleagues provide an overview of the identification of single nucleotide polymorphisms from NGS data, a key step in the analysis of DNA sequencing data. Due to the still high cost of sequencing, Lee and Zhao investigate the use of DNA barcoding and pooling to balance between statistical efficiency and study cost. Based on the inferred variants, Ionita-Laza and colleagues discuss how to perform genetic association analysis for both population-based and family-based samples.