Efficient Parallel and Out of Core Algorithms for Constructing Large Bi-directed de Bruijn Graphs [article]

Vamsi Kundeti, Sanguthevar Rajasekaran, Hieu Dinh
2010 arXiv   pre-print
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories -- based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient
more » ... hms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In Jackson et. al. ICPP-2008, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages. In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(n(n/B)/B(M/B)). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with that of Jackson et. al. ICPP-2008 reveals that our algorithm is faster. We also provide efficient algorithms for the bi-directed chain compaction problem.
arXiv:1003.1940v1 fatcat:tbq37yuvxfd4jh2bx2aim4xpla