Bayesian nonparametric discovery of isoforms and individual specific quantification
Most human protein-coding genes can be transcribed into multiple possible distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and exist in tissue- and sample-specific frequencies. Here, we develop BIISQ, a Bayesian nonparametric model to discover Isoforms
... d Individual Specific Quantification from RNA-seq data. BIISQ does not require known isoform reference sequences but instead estimates isoform composition directly with an isoform catalog shared across samples. We develop a stochastic variational inference approach for efficient and robust posterior inference and demonstrate superior precision and recall for short read RNA-seq simulations and simulated short read data from PacBio long read sequencing when compared to state-of-the-art isoform reconstruction methods. BIISQ achieves the most significant gains for longer (in terms of exons) isoforms and isoforms that are lowly expressed (over 500% more transcripts correctly inferred at low coverage in simulations). Finally, we estimate isoforms in the GEUVADIS RNA-seq data, identify genetic variants that regulate transcript ratios, and demonstrate variant enrichment in functional elements related to mRNA splicing regulation.