The Value of Multiple Read/Write Streams for Approximating Frequency Moments
Paul Beame, Trinh Huynh
2012
ACM Transactions on Computation Theory
Recently, an extension of the standard data stream model has been introduced in which an algorithm can create and manipulate multiple read/write streams in addition to its input data stream. Like the data stream model, the most important parameter for this model is the amount of internal memory used by such an algorithm. The other key parameters are the number of streams the algorithm uses and the number of passes it makes on these streams. We consider how the addition of these multiple
more »
... te streams impacts the ability of algorithms to approximate the frequency moments of the input stream. We show that any randomized read/write stream algorithm with a fixed number of streams and a sublogarithmic number of passes that produces a constant factor approximation of the k-th frequency moment F k of an input sequence of length of at most N from {1, . . . , N } requires space Ω(N 1−4/k−δ ) for any δ > 0. For comparison, it is known that with a single read-only data stream there is a randomized constant-factor approximation for F k usingÕ(N 1−2/k ) space and that there is a deterministic algorithm computing F k exactly using 3 read/write streams, O(log N ) passes, and O(log N ) space. Therefore, although the ability to manipulate multiple read/write streams can add substantial power to the data stream model, with a sub-logarithmic number of passes this does not significantly improve the ability to approximate higher frequency moments efficiently. Our lower bounds also apply to (1 + )-approximations of F k for ≥ 1/N . small space requirements can approximately determine the frequency moments of data streams. The k-th frequency moment, F k , is the sum of the k-th powers of the frequencies with which elements occur in a data stream. F 1 is simply the length of the data stream; F 0 is the number of distinct elements in the stream; if the stream represents keys of a database relation then F 2 represents the size of the self-join on that key. The methods in [AMS99] also yielded efficient randomized algorithms for approximating F * ∞ , the largest frequency of any element in the stream. These results have been extended and improved to apply to many other problems, including approximating arbitrary join sizes, computation of p differences between data streams, and computations over sliding windows (see surveys [Mut06, BBD + 02]). The best one-pass algorithms for frequency moments approximate F k within a (1 + ) factor on streams of length N usingÕ(N 1−2/k ) space [IW05, BGKS06] . Along with designing algorithms for approximating F k , Alon, Matias, and Szegedy showed that their algorithms were not far from optimal in the one-pass model; in particular, they showed that F k requires Ω(N 1−5/k ) space to approximate by randomized one-pass algorithms. They derived their lower bounds by extending bounds [Raz92] for the randomized 2-party communication complexity for a promise version of the set disjointness problem from 2 to p players, where each of the p players has access to its own private portion of the input. (The model is known as the p-party number-inhand communication game.) A series of papers [SS02, BYJKS04, CKS03] has improved the space lower bound to an essentially optimalΩ(N 1−2/k ) by improving the lower bound for the promise disjointness problem for p-party randomized number-in-hand communication games; thus F k for k > 2 requires polynomial space in the data stream model 1 . However, as Grohe and Schweikardt [GS05] observed, in many natural situations for which the data stream model has been studied, the computation also has access to auxiliary external memory for storing intermediate results. In this situation, the lower bounds for the data stream model no longer apply. This motivated Grohe and Schweikardt to introduce a model, termed the read/write streams model in [BJR07], to capture this additional capability. In the read/write streams model, in addition to the input data stream, the computation can manipulate multiple sequentially-accessed read/write streams. As noted in [GS05], the read/write streams model is substantially more powerful than the ordinary data stream model since read/write stream algorithms can sort lists of size N with O(log N ) passes and space using 3 streams and hence compute any F k exactly using only O(log N ) passes, O(log N ) space, and 3 streams. Unfortunately, given the large values of N involved, Θ(log N ) passes through the data is a very large cost. For sorting, lower bounds given in [GS05, GHS06] show that such small space read/write stream algorithms are not possible using fewer passes; moreover, [GHS06, BJR07] show lower bounds for the related problems of determining whether two sets are equal and of determining whether or not the input stream consists of distinct elements. However, these bounds say very little about the problem of approximating frequency moments, which has much less stringent requirements than the above problems. Can read/write stream algorithms approximate larger frequency moments more efficiently than single pass algorithms can? It seems plausible that read/write stream algorithms might be able to compute F k efficiently for larger k than is possible for data stream algorithms. We show that the ability to augment the data stream model with computations using multiple read/write streams does not produce significant additional efficiency in approximating frequency
doi:10.1145/2077336.2077339
fatcat:bueuyk2wyzhbrpzcgxjd6mkcii