Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities

Usman Roshan, Satish Chikkagoudar, Dennis R Livesay
2008 BMC Bioinformatics  
Identification of RNA homologs within large genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence alignment programs (i.e., SSEARCH and BLAST) are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using
more » ... partition function match probabilities, as implemented in the Probalign program, are significantly better than contemporary methods on divergent heterogeneous length multiple protein sequence datasets, thus suggesting an affinity for local alignment. Results: We create a pairwise RNA-genome alignment benchmark using RFAM families with published structure and average pairwise sequence identity up to 60%. Each genomic sequence of each dataset is at least 5K nucleotides long. Furthermore, to simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of a slightly modified version of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER (with single sequence profiles), SSEARCH, and NCBI BLAST, and the popular ClustalW global alignment program with zero terminal gap penalties. Parameters were optimized for each program on a fixed training subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH. This difference has Friedman rank test P-value less than 0.05. Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. The Probalign mean posterior probability has Receiver Operator Characteristic (ROC) area under curve of 0.834 compared to 0.806 of the normalized SSEARCH Z-score. The RNA-genome alignment benchmark, training benchmark, false positive datasets, and the modified Probalign program are available at http://cs.njit.edu/usman/RNAgenome. Conclusions: We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in the Probalign program, provides statistically significant improvement over current approaches for identifying evolutionary distant RNA sequences from larger genomic segments, even when the query RNA is surrounded by unalignable flanks. Contact: usman @cs.njit.edu Background The importance of RNA within cellular machinery and regulation is well established (1,2). Consequently, a proper understanding of RNA structure and function is vital to a more complete understanding of cellular processes. It is conjectured that the human genome contains several thousand yet undiscovered ncRNAs that play critical roles throughout the cell. Profile-sequence and structure-sequence methods, such as HMMER (3) and INFERNAL (4), are commonly used to identify RNA homologs within much larger genomic segments. However, the requirement of a reliable family alignment and/or structure diminishes the utility of these approaches. This can happen especially when searching for evolutionary distant homologs or the query RNA sequence is surrounded by unalignable flanking nucleotides. In fact, homologous sequences below 60% pairwise identity are generally too difficult for current methods (5). Simple pairwise alignment approaches are commonly used when sufficient familial data is not available. The SSEARCH program (6), a popular implementation of the Smith-Waterman algorithm, is frequently used for finding RNA homologs in genomic sequences. Moreover, it is a commonly used benchmark that new homology search methods are compared against (7-10). The NCBI BLAST program (11), which is also a local alignment algorithm, is faster than SSEARCH but much less sensitive. SSEARCH and BLAST both search for optimal local alignments, with BLAST sacrificing sensitivity for speed. Conversely, the maximal expected accuracy approach is based on suboptimal alignments. Here, sequences are aligned using posterior/match probabilities within pairwise alignments. These probabilities can be computed using partition function dynamic programming matrices, introduced by Miyazawa (12) and later studied by others (13,14), or pairwise HMMs as done in ProbconsRNA (15). Partition function posterior probabilities are analogous to nucleotidenucleotide frequency counts estimated from an ensemble of suboptimal alignments (see ref. (14) for more details). We recently implemented the partition function approach within the program Programs Biomed, 85, 203-209. 9. Klein, R.J. and Eddy, S.R. (2003) RSEARCH: finding homologs of single structured RNA sequences.
doi:10.1186/1471-2105-9-61 pmid:18226231 pmcid:PMC2248559 fatcat:f76xlgo5fvctti4ovnf2bpdvha