Faster algorithms for RNA-folding using the Four-Russians method

Balaji Venkatachalam, Dan Gusfield, Yelena Frid
2014 Algorithms for Molecular Biology  
The secondary structure that maximizes the number of non-crossing matchings between complimentary bases of an RNA sequence of length n can be computed in O(n 3 ) time using dynamic programming. Four-Russians is a technique that will reduce the running time for certain dynamic programming algorithms by a factor after a preprocessing step where solutions to all smaller subproblems of a fixed size are exhaustively enumerated. Frid and Gusfield designed an O( n 3 log n ) algorithm for RNA folding
more » ... ing the Four-Russians technique. However, in their algorithm the preprocessing is interleaved with the algorithm computation. We simplify the algorithm and the analysis by doing the preprocessing once prior to the algorithm computation. We call this the two-vector method. We also show variants where instead of exhaustive preprocessing, we only solve the subproblems encountered in the main algorithm once and memoize the results. We give a proof of correctness and explore the practical advantages over the earlier method. The Nussinov algorithm admits an O(n 2 ) parallel algorithm. We show an parallel algorithm using the two-vector idea that improves the time bound to O(n 2 / log n). We have implemented the parallel algorithm on Graphical processing units using CUDA platform. We discuss the organization of the data structures to exploit coalesced memory access for fast running time. These ideas also help in improving the running time of the serial algorithms. For sequences of up to 6000 bases the parallel algorithm takes only about 2 secs, the two-vector and memoized versions are faster than the Frid-Gusfield algorithm by a factor of 3, and faster than Nussinov by a factor of 20. Introduction Computational approaches to find the secondary structure of RNA molecules are used extensively in bioinformatics applications. The classic dynamic programming (DP) algorithm proposed in the 1970s has been central to most structure prediction algorithms. While the objective of the original algorithm was to maximize the number of pairings between complementary bases, the dynamic programming approach has been used for other models and approaches, including minimizing the free energy of a structure. The DP algorithm runs in cubic time and there have been many attempts at improving its running time. Here, we use the four-Russians method for speeding up. Four-Russians method, named after Aralazarov et al. [4], is a method to speed up certain dynamic programming algorithms. In a typical Four-Russians algorithm there is a preprocessing step that exhaustively enumerates the subproblems and the results are tabled. In the main DP algorithm, instead of filling out or inspecting individual cells the algorithm takes longer strides in the table. The computation for multiple cells is solved in constant time by utilizing 13], RNAstructure [20, 23]. Probabilistic methods include stochastic context-free grammars [10, 9] , maximum expected accuracy (MEA) method where secondary structures are composed of pairs that have a maximal sum of pairing probabilities, eg., Max-Expect [16], Pfold [15], CONTRAfold [8] which maximize the posterior probabilities of base pairs; and Sfold [7], CentroidFold [12] that maximize the centroid estimator. There are also other methods that use a combination of thermodynamic and statistical parameters [2] and methods that use training sets of known folds to determine their parameters, eg., CONTRAfold [8], and Simfold[3] and ContextFold[29]. In addition to the four-Russians method, other methods to improve the running time include Valiant's max-plus matrix multiplication [1] and Zakov et al. [30]; and sparsification, where the branch points are pruned to get an improved time bound [26, 5] CUDA, the programming platform for GPGPUs, has been used to solve many bioinformatics problems. Chang, Kimmer and Ouyang [6] show an implementation of the Nussinov algorithm on CUDA. Rizk et al. [24] describe the implementation for Zuker and Stiegler method involving energy parameters. The methods are discussed in section 5. 2
doi:10.1186/1748-7188-9-5 pmid:24602450 pmcid:PMC3996002 fatcat:e6fstiezhjenlo7ty6cnnbqhoi