Recycling RNA-Seq Data to Identify Candidate Orphan Genes for experimental analysis [article]

Jing Li, Zebulun Arendsee, Urminder Singh, Eve Syrkin Wurtele
2019 bioRxiv   pre-print
Motivation: Each organism contains genes with no protein homolog in other species ("orphan genes"). Some of these have arisen de novo from non-genic material, while others may be the result of ultra-rapid mutation of existing genes. The challenges of identifying orphan genes and predicting their functions are immense, resulting in under-appreciation of their importance. The yeast genome expresses thousands of transcripts, many that contain ORFs that are translated, that are not annotated as
more » ... s. Here, we apply computational approaches to re-cycle and re-evaluate massive raw public RNA-Seq data to identify those ORFs that are the best candidates to represent orphan genes. Results: We created a pooled, aggregated RNA-Seq dataset from the raw reads and metadata of over 3,400 RNA-seq samples from 172 studies in the NCBI-Sequence Read Archives (SRA) database (Leinonen et al., 2011), and realigned these reads to a transcriptome consisting of the Saccharomyce Genome Database ((Cherry et al., 1998), SGD)-annotated genes and 29,354 unannotated ORFs of the Saccharomyces cerevisiae genome. Phylostratigraphy analysis of the predicted proteins from the 29,354 non-annotated open reading frames (ORFs) in the S. cerevisiae genome inferred: 15,806 are orphans ("orphan-ORFs"), 11,942 are genus-specific, and 1,606 are more highly conserved. These RNA-Seq data reveal over 150 of transcripts containing orphan encoding-ORFs with mean levels of expression across all samples comparable to half of annotated non-orphan genes. Most orphan-encoding ORFs are highly expressed only under limited conditions. We built a co-expression matrix from the transcription dataset, and optimized partitioning by Markov Chain Clustering. The MCL clustering result is significant different from random clusters based on GO enrichment analysis to show the biological significance. Over 3,000 significant GO terms (p-value<0.05) were found in the clusters, and plenty of unannotated ORFs were found highly correlated (PCC > 0.8) to annotated genes. For example, cluster 112 is composed of seripauperin genes, and smORF247301 is correlated to YPL223C with a 0.95 Pearson correlation. We provide the results of the optimized aggregate-data analysis in a tool that can be used for powerful statistical analysis and visualization of specific transcripts under user-selected conditions. This approach maximizes an ability to view potential interactions across experimental perturbations, and provides a rich context for experimental biologists to make novel, experimentally-testable hypotheses as to potential functions of as yet unannotated transcripts.
doi:10.1101/671263 fatcat:53hr5dri7bg3vjbd2jbv3cdoui