Tracking down noncoding RNAs

V. Moulton
2005 Proceedings of the National Academy of Sciences of the United States of America  
U ntil relatively recently, RNA has taken a predominantly backstage role compared to protein in genome studies. However, this is changing dramatically with the discovery of a plethora of RNAs that do not act as messenger (mRNA), transfer (tRNA), or ribosomal (rRNA) RNAs (1-3). These noncoding RNAs (ncRNAs) play a role in a variety of processes such as transcriptional regulation, chromosome replication, RNA processing and modification, and protein degradation and translocation. Even so, ncRNAs
more » ... ually lack the statistical signals in their primary sequence (like ORFs and codon bias) that have been used to such great effect in the identification of novel protein encoding genes, making the task of systematically identifying new ncRNAs in genomes currently one of the most exciting challenges in computational biology. The work of Washietl et al. in this issue of PNAS (4) faces this challenge head on. Through an elegant use of structural properties of RNA, the authors present an efficient comparative genomics approach to identifying novel ncRNAs and related genomic elements that promises to significantly contribute to the burgeoning field of computational RNomics. Predicting RNA Structure As with other computational approaches to identifying ncRNAs, the method of Washietl et al. (4) relies on structural properties of RNA. Unlike doublestranded DNA, an RNA molecule is comprised of a single-stranded chain or sequence of nucleotides. As a consequence, parts of the molecule can basepair with other complementary parts of the molecule, so that the nucleotide sequence plays a vital role in how the molecule folds. For this reason, it is possible to develop computational methods for predicting structural properties of an RNA molecule based on knowledge of its primary sequence. As with proteins, the problem of predicting the three-dimensional structure of an RNA molecule directly from its primary sequence is still beyond current computational methods. However, the three-dimensional structure of an RNA molecule often builds on a simpler scaffold known as its secondary structure. This structure consists essentially of nested base-pairings, which makes it well suited to computa-tional prediction. Moreover, secondary structure is commonly preserved under evolution (even when primary sequence is not), suggesting relevance to RNA function. One of the first efficient algorithms for predicting secondary structure for an RNA sequence used dynamic programming to compute a maximum set of nested base-pairings (5). A more sophisticated extension of this algorithm soon followed (6), which incorporated more detailed secondary structure information. Basically, it used thermodynamic considerations to compute a secondary structure with minimum free energy for an RNA sequence. Although the method has been substantially developed since its introduction, and even greatly extended for the prediction of probably more realistic ensembles of secondary structures (7, 8), the underlying algorithm still lies in essence at the heart of many present day RNA secondary structure prediction tools. However, such tools use primary sequence alone, so they tend not to perform as well as one might hope, commonly predicting only 50-70% of base pairs correctly on average (9). Comparative Sequence Analysis Because secondary structure is often preserved between homologous RNAs, comparative sequence analysis can provide a powerful alternative for its prediction. One of the earliest methods based on comparative analysis used mutual information to detect covarying columns in an alignment of RNA sequences (10). Related, but much more sophisticated, covariance models (11), the RNA analogue of hidden Markov models, were subsequently developed and successfully used in genomic searches for ncRNAs and are now available as part of the recently established Rfam database for RNA families (12). Covariance models are family-specific and, as such, do not provide a generic tool for finding novel ncRNAs. However, the preservation of RNA secondary structure in an alignment naturally suggests a comparative genomics approach to finding ncRNAs: form alignments between conserved subsequences of genomes and then, by using secondary structure detection approaches, try to decide which of these are alignments of ncRNAs. One of the first programs to employ this strategy was QRNA (13), which used probabilistic models to search for covariation in pairwise alignments and has been used to identify novel ncRNAs in bacteria and yeast. More recent methods include DDBRNA (14) and MSARI (15), which look for statistically significant covariation in multiple sequence alignments. Picking Up the Signal The method of Washietl et al. (4) employs a similar strategy. Le et al. (16) proposed that ncRNAs are more thermodynamically stable than is expected by chance. There has been much debate over this hypothesis, and the current general consensus is that it is not generally true. Even so, recent findings indicate that certain families of ncRNAs are, in fact, more stable than is expected by chance (most notably microRNA precursors; ref. 17), and Washietl et al. demonstrate that stability can, at the very least, be used as a diagnostic feature for detecting ncRNAs. In particular, they associate two scores to an alignment: the z score, a measure thermodynamic stability, and the structure conservation index (SCI), a measure of evolutionary conservation. The z score is quite well known in the RNA computational biology community. However, the SCI is new. It is computed by comparing the minimum free energies of the sequences in an alignment with a "consensus energy," which is computed by incorporating covariation terms into a free energy minimization computation (18). Subsequently, a support vector machine is used to classify alignments as "functional" or "other" in the SCI͞z score plane. This approach has the advantage of not requiring costly sampling of shuffled sequences or alignments, and the results obtained on
doi:10.1073/pnas.0500129102 pmid:15703286 pmcid:PMC549017 fatcat:hmhdljjxxjhyrmf2lalst2lx3i