Faster Fully Compressed Pattern Matching by Recompression

Artur Jeż
2015 ACM Transactions on Algorithms  
In this article, a fully compressed pattern matching problem is studied. The compression is represented by straight-line programs (SLPs)-that is, context-free grammars generating exactly one string; the term fully means that both the pattern and the text are given in the compressed form. The problem is approached using a recently developed technique of local recompression: the SLPs are refactored so that substrings of the pattern and text are encoded in both SLPs in the same way. To this end,
more » ... e SLPs are locally decompressed and then recompressed in a uniform way. This technique yields an O((n + m) log M) algorithm for compressed pattern matching, assuming that M fits in O(1) machine words, where n (m) is the size of the compressed representation of the text (pattern, respectively), and M is the size of the decompressed pattern. If only m + n fits in O(1) machine words, the running time increases to O((n+ m) log M log(n+ m)). The previous best algorithm due to Lifshits has O(n 2 m) running time. A. Jeż matching, equality testing, etc.) are known for various practically used compression methods (LZ77, LZW, their variants, etc.) [Gawrychowski The compression standards differ in the main idea as well as in details. Thus, when devising algorithms for compressed data, one needs to focus quite early on the exact compression method to which the algorithm is applied. The most practical (and challenging) choice is one of the widely used standards, such as LZW or LZ77. However, a different approach is also pursued: for some applications (and most of theory-oriented considerations), it would be useful to model one of the practical compression standards by a more mathematically well-founded and "clean" method. This idea rests at the foundations of the notion of straight-line programs (SLPs), which simply are contextfree grammars generating exactly one string. Other reasons for the popularity of SLPs is that usually they compress well the input text [Larsson and Moffat 1999; Nevill-Manning and Witten 1997] and that they are closely related to the LZ77 compression standard: each LZ77 compressed text can be converted into an equivalent SLP of size O(n log(N/n)) and in O(n log(N/n)) time [Rytter 2003; Charikar et al. 2005] (where N is the size of the decompressed text), whereas each SLP can be converted to an equivalent LZ77-like of O(n) size in polynomial time. Finally, a greedy grammar compression can be efficiently implemented and thus can be used as a preprocessing to other compression methods, like those based on Burrows-Wheeler transform [Kärkkäinen et al. 2012] . Problem statement. In this article, we consider the fully compressed membership problem (FCPM), in which we are given a text of length N and pattern of length M, represented by SLPs of size n and m, respectively. We are to answer whether the pattern occurs in the text and give a compact representation of all such occurrences.
doi:10.1145/2631920 fatcat:pcdrs47ghzbdvn5eia4pb7eqdi