Efficient indexing algorithms for approximate pattern matching in text

Matthias Petri, J. Shane Culpepper
2012 Proceedings of the Seventeenth Australasian Document Computing Symposium on - ADCS '12  
Approximate pattern matching is an important computational problem with a wide variety of applications in Information Retrieval. Efficient solutions to approximate pattern matching can be applied to natural language keyword queries with spelling mistakes, OCR scanned text incorporated into indexes, language model ranking algorithms based on term proximity, or DNA databases containing sequencing errors. In this paper, we present a novel approach to constructing text indexes capable of
more » ... supporting approximate search queries. Our approach relies on a new variant of the Context Bound Burrows-Wheeler Transform (k-BWT), referred to as the Variable Depth Burrows-Wheeler Transform (v-BWT). First, we describe our new algorithm, and show that it is reversible. Next, we show how to use the transform to support efficient text indexing and approximate pattern matching. Lastly, we empirically evaluate the use of the v-BWT for DNA and English text collections, and show a significant improvement in approximate search efficiency over more traditional q-gram based approximate pattern matching algorithms.
doi:10.1145/2407085.2407087 dblp:conf/adcs/PetriC12 fatcat:n5wzkfhvxrhwdmaxr7brmc4xxi