Compressing Multisets with Large Alphabets
[article]

2021
arXiv
pre-print

Current methods that optimally

arXiv:2107.09202v1
fatcat:xpmxlyp2nfbkllsnrhcnbdvupa
*compress**multisets*are not suitable for high-dimensional symbols, as their compute time scales linearly*with**alphabet*size. ...*Compressing*a*multiset*as an ordered sequence*with*off-the-shelf codecs is computationally more efficient, but has a sub-optimal*compression*rate, as bits are wasted encoding the order between symbols. ... Related work To the best of our knowledge, there are no previous works that present a method which is both computationally feasible and rate-optimal for*compressing**multisets*of i.i.d. symbols*with**large*...##
Compressing combinatorial objects
[article]

2016
arXiv
pre-print

However, there are many types of non-sequential data for which good

arXiv:1601.03689v1
fatcat:wcscbo4wezal7bs2szxnro2wkq
*compression*techniques are still*largely*unexplored. ... Near-optimal*compression*methods are described for certain types of permutations, combinations and*multisets*; and the conditions for optimality are made explicit for each method. ...*alphabet*X . ...##
Tight Bounds on Profile Redundancy and Distinguishability

2012
Neural Information Processing Systems
A sufficient statistic for all these properties is the data's profile, the

dblp:conf/nips/AcharyaDO12
fatcat:jn2z7ti44be4jp5vqv6eftfani
*multiset*of the number of times each data element appears. ... In*compression*, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. ... For sufficiently*large*k, this value even exceeds n itself, showing that general distributions over*large**alphabets*cannot be*compressed*or learned at a uniform rate over all*alphabet*sizes, and as the ...##
On Universal Coding of Unordered Data

2007
2007 Information Theory and Applications Workshop
This further implies that finite-

doi:10.1109/ita.2007.4357578
fatcat:cdhuuhdlazewdozxmrwju526a4
*alphabet*memoryless*multisets*cannot be encoded universally*with*vanishing fractional redundancy. ... of finite-*alphabet*memoryless*multisets*. ... Countable*Alphabets*The previous discussion has dealt*with*the entropy of finitealphabet*multisets*, but what about countable*alphabets*? ...##
Tight bounds for universal compression of large alphabets

2013
2013 IEEE International Symposium on Information Theory
., [1] [2] [3] [4] [5] [6] [7] and references therein, have considered universal

doi:10.1109/isit.2013.6620751
dblp:conf/isit/AcharyaDJOS13
fatcat:fpfkmqupn5gi5bqnsn76t5eopm
*compression*of sources over*large**alphabets*, often using patterns to avoid infinite redundancy. ... To address this fast increase in redundancy*with*the*alphabet*size, a new approach was proposed for*compression*and estimation over*large**alphabets*. ... A natural method for*compressing*a sequence over a*large**alphabet*is to*compress*its pattern as well as the dictionary that maps the order to the original symbols. ...##
Compressed word problems for inverse monoids
[article]

2011
arXiv
pre-print

The

arXiv:1106.1000v1
fatcat:kmys7kimafbqlm2morr2g3yfri
*compressed*word problem for a finitely generated monoid M asks whether two given*compressed*words over the generators of M represent the same element of M. ... For string*compression*, straight-line programs, i.e., context-free grammars that generate a single string, are used in this paper. ... In [27] , Margolis and Meakin presented a*large*class of finitely presented inverse monoids*with*decidable word problems. ...##
Benefiting from Disorder: Source Coding for Unordered Data
[article]

2007
arXiv
pre-print

In particular, lossless coding of n letters from a finite

arXiv:0708.2310v1
fatcat:lth2kyrzqzdknpbxewhbum627q
*alphabet*requires Theta(log n) bits and universal lossless coding requires n + o(n) bits for many countable*alphabet*sources. ... ACKNOWLEDGMENTS The authors thank Alon Orlitsky for fruitful discussions; in particular, the results in Section IV-A were developed in collaboration*with*him. The authors also thank Sanjoy K. ...*Large*-Size*Multiset*Asymptotics 1)*Multiset*Mean Squared Error: Assume that the source*alphabet*X is a subset of the real numbers. ...##
Super-Linear Indices for Approximate Dictionary Searching
[chapter]

2012
Lecture Notes in Computer Science
These methods require huge indices whose sizes grow exponentially

doi:10.1007/978-3-642-32153-5_12
fatcat:52enei4um5dp5hd55lstjviywu
*with*respect to the maximum allowable number of errors k. ... One approach to*compress*the full neighborhood is to replace some characters*with*wildcards. Let us extend the*alphabet**with*a wildcard pseudo-character ? that matches any*alphabet*character. ... This method is not efficient for*large*k and/or*large**alphabets*, because the size of the full neighborhood is O n k |Σ| k (where n and |Σ| is the size of the pattern and the*alphabet*, respectively) [21 ...##
Compressing multisets using tries

2012
2012 IEEE Information Theory Workshop
We consider the problem of efficient and lossless representation of a

doi:10.1109/itw.2012.6404756
dblp:conf/itw/GriponRSG12
fatcat:kcviahm3xbg5rkhfmo5od6gtba
*multiset*of m words drawn*with*repetition from a set of size 2 n . ...*with*the same words. ... CONCLUSION We introduced an algorithm (AlgI) to*compress**multisets*of binary words obtained using a Bernoulli 1/2 source. ...##
Weisfeiler-Lehman Graph Kernels

2011
Journal of machine learning research
In this article, we propose a family of efficient kernels for

dblp:journals/jmlr/ShervashidzeSLMB11
fatcat:qj5wpmzbozh65pj6azzoeijumq
*large*graphs*with*discrete node labels. ... Our kernels open the door to*large*-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. ... S. was funded by the DFG project "Kernels for*Large*, Labeled Graphs (LaLa)". ...##
Minimax Trees in Linear Time with Applications
[chapter]

2009
Lecture Notes in Computer Science
Suppose we want to build a good prefix code

doi:10.1007/978-3-642-10217-2_28
fatcat:ljkp7az66zeztmfhwgbjpcwcsa
*with*which to*compress*a file, but are given only a sample of its characters. ... We are still studying*alphabetic*minimax trees and have started studying minimax trees*with*unequal edge costs. ...##
Codes in the Space of Multisets—Coding for Permutation Channels With Impairments

2018
IEEE Transactions on Information Theory
of symbols from a given finite

doi:10.1109/tit.2017.2789292
fatcat:weas33cgczaejnaf4yeoyl2b6m
*alphabet*. ... A general channel model is assumed in which the transmitted*multisets*are potentially impaired by insertions, deletions, substitutions, and erasures of symbols. ... As we have shown, the study of*multiset*codes over a fixed*alphabet*reduces to the study of codes in A m lattices, at least in the*large*block-length limit. ...##
Optimal Prefix Free Codes with Partial Sorting

2019
Algorithms
s deferred data structure to partially sort a

doi:10.3390/a13010012
fatcat:ibk6k7d6o5fbzc4xwhiianc7xa
*multiset*accordingly to the queries performed on it (known since 1988). ... the new analysis technique, such improvement is obtained by combining a new algorithm, inspired by van Leeuwen's algorithm to compute optimal prefix free codes from sorted weights (known since 1976),*with*... of natural languague texts, cited as an example of "*large**alphabet*" application by Moffat [3] , and studied by Moura et al ...##
Classification using pattern probability estimators

2010
2010 IEEE International Symposium on Information Theory
We motivate and propose LRT's based on pattern probability estimators that are known to achieve low redundancy for universal

doi:10.1109/isit.2010.5513570
dblp:conf/isit/AcharyaDOPS10
fatcat:ykj64mo4pnhp3j2omzmdurpjve
*compression*of*large**alphabet*sources. ... We are primarily interested in situations where the*alphabet*of the underlying distributions is*large*compared to the training data available, which is indeed the case in most practical applications. ... In the context of universal*compression*, it was previously shown in [8] that patterns can be*compressed**with*diminishing per symbol redundancy regardless of*alphabet*size of the underlying distribution ...##
Estimating multiple concurrent processes

2012
2012 IEEE International Symposium on Information Theory Proceedings
For Poisson processes, if any estimator approximates the parameter

doi:10.1109/isit.2012.6283551
dblp:conf/isit/AcharyaDJOP12
fatcat:cqezmcvlbbfzfllevbnltkk7y4
*multiset*to within distance*with*error probability δ, then PML approximates the*multiset*to within distance 2*with*error probability at ... For both problems, it is sufficient to consider the observations' profile-the*multiset*of activity counts, regardless of their process identities. ... of*large**alphabet*data sources. ...
