Prediction of RNA secondary structure by maximizing pseudo-expected accuracy

Michiaki Hamada, Kengo Sato, Kiyoshi Asai
2010 BMC Bioinformatics  
Recent studies have revealed the importance of considering the entire distribution of possible secondary structures in RNA secondary structure predictions; therefore, a new type of estimator is proposed including the maximum expected accuracy (MEA) estimator. The MEA-based estimators have been designed to maximize the expected accuracy of the base-pairs and have achieved the highest level of accuracy. Those methods, however, do not give the single best prediction of the structure, but employ
more » ... ameters to control the trade-off between the sensitivity and the positive predictive value (PPV). It is unclear what parameter value we should use, and even the well-trained default parameter value does not, in general, give the best result in popular accuracy measures to each RNA sequence. Results: Instead of using the expected values of the popular accuracy measures for RNA secondary structure prediction, which is difficult to be calculated, the pseudo-expected accuracy, which can easily be computed from base-pairing probabilities, is introduced. It is shown that the pseudo-expected accuracy is a good approximation in terms of sensitivity, PPV, MCC, or F-score. The pseudo-expected accuracy can be approximately maximized for each RNA sequence by stochastic sampling. It is also shown that well-balanced secondary structures between sensitivity and PPV can be predicted with a small computational overhead by combining the pseudo-expected accuracy of MCC or F-score with the γ-centroid estimator. Conclusions: This study gives not only a method for predicting the secondary structure that balances between sensitivity and PPV, but also a general method for approximately maximizing the (pseudo-)expected accuracy with respect to various evaluation measures including MCC and F-score. The proposed methods are extendable to other situations We are able to introduce the pseudo-expected accuracy for common secondary structure prediction of multiple alignments of RNA sequences, because there are several probability distributions for the common secondary structures, for example, the RNAalifold model [34, 35] and the Pfold model [36]. Also, the γ-centroid estimator can be extended to common secondary structure prediction [10], and the pseudo-expected MCC/F-score combined with the estimator is useful to predict the common secondary structure that balances between SEN and PPV (See [37]). Recently, Lu et al. [6] proposed the relaxed SEN, PPV and MCC, where slippage of base-pair is allowed in computing those measures. It is possible to design the γ-centroid-type estimator that fits with those measures and also to introduce pseudoexpected accuracy of those measures. Moreover, the methods used in this paper can be extended to more general types of estimation problems (cf. [17]) with various accuracy measures that are defined by using TP, TN, FP and FN (cf. [29]). 8
doi:10.1186/1471-2105-11-586 pmid:21118522 pmcid:PMC3003279 fatcat:s75xuh5ndjhgxo5buymwgsyhou