Fast Searching in Packed Strings [article]

Philip Bille
2010 arXiv   pre-print
Given strings P and Q the (exact) string matching problem is to find all positions of substrings in Q matching P. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time which is optimal if we can only read one character at the time. However, most strings are stored in a computer in a packed representation with several characters in a single word, giving us the opportunity to read multiple characters simultaneously. In this paper we
more » ... udy the worst-case complexity of string matching on strings given in packed representation. Let m ≤ n be the lengths P and Q, respectively, and let σ denote the size of the alphabet. On a standard unit-cost word-RAM with logarithmic word size we present an algorithm using time O(n/_σ n + m + ). Here is the number of occurrences of P in Q. For m = o(n) this improves the O(n) bound of the Knuth-Morris-Pratt algorithm. Furthermore, if m = O(n/_σ n) our algorithm is optimal since any algorithm must spend at least Ω((n+m)σ/ n + ) = Ω(n/_σ n + ) time to read the input and report all occurrences. The result is obtained by a novel automaton construction based on the Knuth-Morris-Pratt algorithm combined with a new compact representation of subautomata allowing an optimal tabulation-based simulation.
arXiv:0907.3135v2 fatcat:nt26xd3nuvemxnb6qdkr6dip5a