DCA Using Suffix Arrays

Martin Fiala, Jan Holub
2008 Data Compression Conference (DCC), Proceedings  
DCA (Data Compression using Antidictionaries) is a novel lossless data compression method working on bit streams presented by Crochemore et al. [1] . DCA takes advantage of words that do not occur as factors in the text, i.e. that are forbidden. Due to these forbidden words (antiwords), some symbols in the text can be predicted. First an input text (over binary alphabet Σ = {0, 1}) is analyzed and all minimal forbidden words are found and stored into antidictionary AD. Whenever we reach a word
more » ... so that ua ∈ AD, u ∈ Σ * , a ∈ Σ, the following symbol can be predicted as a complement of a and does not need to be stored. When compressing the file, symbols that can be predicted are erased. Once we have the antidictionary constructed the compression as well as the decompression are extremely fast-a simple transducer is used. Suffix Array Usage In [1] the suffix trie was used for the antidictionary construction, which is the most time and space consuming part of the method. One of the main problems of suffix trie is its memory consumption. Even for antiwords longer than 30 bits and small input files, the suffix trie size grows very fast and needs tens to hundreds megabytes. Creation and traversal through the whole trie is quite slow. We build the antidictionary using suffix array in time O(k * N log N ), where k is maximal antiword length. Length of suffix array and LCP constructed over the binary alphabet will be 8 times length of the input text. Still memory requirements for suffix array and LCP construction depend only on the length N of input text with O(N ), instead of suffix trie with exponential complexity depending on the trie depth. We use Manzini-Ferragina construction [3] of suffix array. Dynamic Compression Scheme With dynamic approach we read text only once, we compress input and modify AD at the same time. Whenever we read some input, we recompute AD again and use it for compressing the next input. Every time we read a symbol that violates the current AD and brings a forbidden word, we need to handle this exception. In the approach we don't have to separately encode AD, do self-compression or even to simple prune it. We simply use all antiwords found yet. Memory requirements are smaller, method is quite fast, as it does not need to do breadth first search for building AD or any other tree traversal for computing gains. Even it is very simple to implement it if we don't bother with suffix trie memory greediness and we don't need to read the text twice as in the static scheme. On the other hand there are some disadvantages as well: decompression is slower, parallel compressors/decompressors cannot be used, we lose k-local property. We have implemented static and dynamic methods with RLE (run length encoding) as well as almost antiwords. file original gzip bzip2 almostaw-30 dyn.-rle-32 rle-34 DI [2] news
doi:10.1109/dcc.2008.95 dblp:conf/dcc/FialaH08 fatcat:jrkrxwgqdnfcxo73pcblnuiytu