Compressing Multisets with Large Alphabets [article]

Daniel Severo, James Townsend, Ashish Khisti, Alireza Makhzani, Karen Ullrich
2021 arXiv   pre-print
Current methods that optimally compress multisets are not suitable for high-dimensional symbols, as their compute time scales linearly with alphabet size. Compressing a multiset as an ordered sequence with off-the-shelf codecs is computationally more efficient, but has a sub-optimal compression rate, as bits are wasted encoding the order between symbols. We present a method that can recover those bits, assuming symbols are i.i.d., at the cost of an additional 𝒪(|ℳ|log M) in average time
more » ... ty, where |ℳ| and M are the total and unique number of symbols in the multiset. Our method is compatible with any prefix-free code. Experiments show that, when paired with efficient coders, our method can efficiently compress high-dimensional sources such as multisets of images and collections of JSON files.
arXiv:2107.09202v1 fatcat:xpmxlyp2nfbkllsnrhcnbdvupa