### Taming big probability distributions

Ronitt Rubinfeld
2012 XRDS Crossroads The ACM Magazine for Students
These days, it seems that we are constantly bombarded by discussions of "big data" and our lack of tools for processing such vast quantities of information. An important class of big data is most naturally viewed as samples from a probability distribution over a very large domain. Such data occurs in almost every setting imaginable -examples include samples from financial transactions, seismic measurements, neurobiological data, sensor nets, and network traffic records. In many cases there is
more » ... ny cases there is no explicit description of the distribution -just samples. Even so, in order to effectively make use of such data, one must estimate natural parameters and understand basic properties of the underlying probability distribution. Typical questions include: How many distinct elements have non-zero probability in the distribution? Is the distribution uniform, normal or Zipfian? Is a joint distribution independent? What is the entropy of the distribution? All of these questions can be answered fairly well using classical techniques in a relatively straightforward manner. However, unless assumptions are made on the distribution, such as that the distribution is Gaussian or has certain "smoothness" properties, such techniques use a number of samples that scale at least linearly with the size of the domain of the distributions. Unfortunately, the challenge of big data is that the sizes of the domains of the distributions are immense. The good news is that there has been exciting recent progress in the development of sub-linear sample algorithmic tools for such problems! In this article, we will describe two lines of results, the first on testing the similarity of distributions and the second on estimating the entropy of a distribution, which highlight the main new ideas that have led to this progress. We assume that all of our probability distributions are over a finite domain D of size n, but (unless otherwise noted) we do not assume anything else about the distribution. Closeness to another distribution How can we tell whether two distributions are the same? There are many variants of this question that have been considered, but let's begin with a simpler question, motivated by the following: How many years of lottery results would it take for us to believe in its fairness? In our setting -given samples of a single distribution p, how many samples do we need to determine whether p is the uniform distribution? To properly formalize this problem, we need to allow some form of approximation, since p could be arbitrarily close to uniform, though not exactly uniform, and no algorithm that takes finite samples would have enough information to detect this. We will use the property testing framework: What we ask of our testing algorithm is to "pass" distributions that are uniform and to "fail" distributions that are far from uniform. We next need to decide what we mean by "far" -many distance measures are in common use, but for this article we will use the L 1 distance between two probability distributions p and q is defined as: For 0 < < 1, we say that p and q are -close with respect to the L 1 distance if ||p, q|| 1 ≤ . Denote by U D the uniform distribution on D. Then, on input parameter 0 < < 1, the goal of the testing algorithm will be to pass p if it is uniform and fail if ||p, U D || 1 ≥ . If p is in the middle -not uniform, but not far from uniform -then either "pass" or "fail" is an allowable, and not unreasonable, answer. One natural way to solve this problem, which we will refer to as the "naive algorithm", is to take enough samples of p so that one can get a good estimate of the probability p(x) for each domain element x. It is easy to see that there are distributions for which such a scheme would require at least linear in |D| = n samples. However, there is a much more efficient O( √ n/ 4 ) sample algorithm, based on an idea of Goldreich and Ron [GR00] (see also [Pan08] for a more recent algorithm which requires only O( √ n/ 2 ) samples). This algorithm does not attempt to learn any of the probabilities of specific domain elements according to the distribution p. Instead, the algorithm counts "collisions" -the number of times that samples coincidentally 1 fall on the same domain element. Slightly more specifically, for a set of k samples x 1 , . . . , x k , let i, j ∈ [1..k] be two indices of samples. Then we say that i and j "collide" if they output the same domain element, i.e.,