Streaming and Sublinear Approximation of Entropy and Information Distances [article]

Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian
2005 arXiv   pre-print
In many problems in data mining and machine learning, data items that need to be clustered or classified are not points in a high-dimensional space, but are distributions (points on a high dimensional simplex). For distributions, natural measures of distance are not the ℓ_p norms and variants, but information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and others. Efficient estimation of these distances is a key component in algorithms for manipulating
more » ... tions. Thus, sublinear resource constraints, either in time (property testing) or space (streaming) are crucial. We start by resolving two open questions regarding property testing of distributions. Firstly, we show a tight bound for estimating bounded, symmetric f-divergences between distributions in a general property testing (sublinear time) framework (the so-called combined oracle model). This yields optimal algorithms for estimating such well known distances as the Jensen-Shannon divergence and the Hellinger distance. Secondly, we close a ( n)/H gap between upper and lower bounds for estimating entropy H in this model. In a stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. We also provide other results along the space/time/approximation tradeoff curve.
arXiv:cs/0508122v2 fatcat:6t3pruhej5h5deybzpl5xqf6nq