Compressed pattern matching in DNA sequences

Lei Chen, Shiyong Lu, J. Ram
Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004.  
We propose derivative Boyer-Moore (d-BM), a new compressed pattern matching algorithm in DNA sequences. This algorithm is based on the Boyer-Moore method, which is one of the most popular string matching algorithms. In this approach, we compress both DNA sequences and patterns by using two bits to represent each A, T, C, G character. Experiments indicate that this compressed pattern matching algorithm searches long DNA patterns (length > 50) more than 10 times faster than the exact match
more » ... of the software package Agrep, which is known as the fastest pattern matching tool. Moreover, compression of DNA sequences by this method gives a guaranteed space saving of 75%. In part the enhanced speed of the algorithm is due to the increased efficiency of the Boyer-Moore method resulting from an increase in alphabet size from 4 to 256.
doi:10.1109/csb.2004.1332418 dblp:conf/csb/ChenLR04 fatcat:747uhput2zfnvaiog3hapfxu7a