Optimal-Time Text Indexing in BWT-runs Bounded Space [article]

Travis Gagie, Gonzalo Navarro, Nicola Prezza
2017 arXiv   pre-print
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the
more » ... ext (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ) within O(r(n/r)) space, on a RAM machine of w=Ω( n) bits. Within O(r (n/r)) space, our index can also count in optimal time O(m). Raising the space to O(r w_σ(n/r)), we support count and locate in O(m(σ)/w) and O(m(σ)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r(n/r)) space that replaces the text and extracts any text substring of length ℓ in almost-optimal time O((n/r)+ℓ(σ)/w). (...continues...)
arXiv:1705.10382v4 fatcat:hfrc7jgbffdotaiha672zfl5pe