Revisiting bounded context block-sorting transformations

J. Shane Culpepper, Matthias Petri, Simon J. Puglisi
2011 Software, Practice & Experience  
The Burrows-Wheeler Transform (bwt) produces a permutation of a string X, denoted X * , by sorting the n cyclic rotations of X into full lexicographical order, and taking the last column of the resulting n × n matrix to be X * . The transformation is reversible in O(n) time. In this paper, we consider an alteration to the process, called k-bwt, where rotations are only sorted to a depth k. We propose new approaches to the forward and reverse transform, and show the methods are efficient in
more » ... ice. More than a decade ago, two algorithms were independently discovered for reversing k-bwt, both of which run in O(nk) time. Two recent algorithms have lowered the bounds for the reverse transformation to O(n log k) and O(n) respectively. We examine the practical performance for these reversal algorithms. We find the original O(nk) approach is most efficient in practice, and investigate new approaches, aimed at further speeding reversal, which store precomputed context boundaries in the compressed file. By explicitly encoding the context boundaries, we present an O(n) reversal technique that is both efficient and effective. Finally, our study elucidates an inherently cache-friendly -and hitherto unobserved -behaviour in the reverse k-bwt, which could lead to new applications of the k-bwt transform. In contrast to previous empirical studies, we show the partial transform can be reversed significantly faster than the full transform, without significantly affecting compression effectiveness. Most bwt-based compression systems fully sort the cyclic rotations of X, and nearly all current empirical studies assume a full sorting of rotations. However, a full sorting of the rotations is resource intensive. In independent work, Schindler [13] and Yokoo [14] described an alternative approach in which the n rotations are only partially sorted to a fixed prefix depth, k. We refer to this modified transform as k-bwt. By limiting the sort depth to k, sorting can be accomplished in O(nk) time using radix sort, and is very fast in practice. Moreover, Schindler reports nearly identical compression effectiveness to the full transform, even for small values of k. The algorithms developed by Schindler [13] were subsequently made available in the general purpose compression tool szip. However, the simplification of the forward k-bwt transform comes at a cost: the reverse transform becomes more expensive, at least in theory. Our contribution: First, we describe an efficient forward k-bwt algorithm based on induced sorting techniques from suffix array construction [15] . Our second contribution is a practical, O(n) k-bwt time reversal algorithm that implicitly stores context boundaries. Third, we provide the first thorough empirical analysis of state-of-the-art k-bwt algorithms for the forward and inverse transforms, compression effectiveness, and associated trade-offs. Lastly, we discover a previously undocumented locality of access property inherent to k-bwt algorithms, allowing fast transform reversal for small k. BACKGROUND AND NOTATION Let X = X[0..n] = X[0]X[1]..X[n] be a string (or text) of n + 1 symbols, where the first n symbols of X are drawn from an alphabet Σ and comprise the actual input; X[n] = $ is a unique "end-of-string" symbol that is defined to be lexicographically smaller than all symbols in Σ. The string X i = X[i..n]X[0..i − 1] represents the i th rotation of X or "rotation i" less formally. The substring X[i..n] is the i th suffix of X, or "suffix i". Rotation i is always prefixed with suffix i as a result of the unique end of stream symbol $. The k-bwt depends upon a partial sort of the rotations of X, based on an ordering of the prefixes of these rotations of length k ≥ 1. We refer to the partial ordering as a k-ordering of rotations into k-order, and to the process itself as a k-sort. If two or more rotations are equal under k-order, the rotations have the same k-rank and therefore fall into the same k-group. Throughout this paper we assume a k-sort is stable. This assumption guarantees the ordering within each k-group coincide with their original ordering in X. The Burrows-Wheeler Transform String X is transformed into X * using the following technique [1]: 1. Form a matrix M whose rows are the cyclic rotations of X; 2. Sort the rows of M into lexicographical order and let F and L be the first and last columns of M respectively; 3. So, X * = L[0]L[1]...L[n]. To reverse the transform we also must record position I, which corresponds to the row in M where the original string appears. Let M k refer to the matrix M of rotations with the rotations stably k-sorted. Therefore, the original matrix of rotations by order of starting positions with respect to X is M 0 , and M n corresponds to the fully sorted matrix of bwt. Let LF k be the mapping of each symbol in L to its corresponding position in L for M k . For clarity, we use LF n to denote the true LF -mapping, that is the LF -mapping from the fully sorted bwt. We use L k and L n in a similar way. Let X * k be the last column in M k . The output of bwt is X * , and X * k represents the output of k-bwt. If row j of M k contains rotation i then Pred(j) is the row containing rotation i − 1. Figure 1 shows the the fully sorted matrix M n (right) and the partially sorted matrix M 2 (left) for the string "knicknack$". Observe the rows of M n in the figure -up to
doi:10.1002/spe.1112 fatcat:ildlxj5ejzfkpipdpfdtvqeruq