Histogramming Data Streams with Fast Per-Item Processing [chapter]

Sudipto Guha, Piotr Indyk, S. Muthukrishnan, Martin J. Strauss
2002 Lecture Notes in Computer Science  
A v ector A of length N can be approximately represented by a histogram H, by writing 0; N as the non-overlapping union of B intervals Ij, assigning a value bj to Ij, and approximating Ai by Hi = bj for i 2 Ij. An optimal histogram representation Hopt consists of the choices of Ij and bj that minimize the sum-square-error kA , Hk 2 2 = P i jAi ,Hij 2 . Numerous applications in statistics, signal processing and databases rely on histograms; typically B is signi cantly smaller than N and, hence,
more » ... epresenting A by H yields substantial compression. We give a deterministic algorithm that approximates Hopt and outputs a histogram H such that kA , Hk 2 2 1 + kA , Hoptk 2 2 : Our algorithm considers the data items A0;A1; : : : in order, i.e., i n o n e pass, spends processing time O1 per item, uses total space BlogN log kAk = O1 , and determines the histogram in time OB logN log kAk = O1 . Our algorithm is eminently suitable to emerging applications where signal is presented in a stream, size of the signal is very large, and one must construct the histogram using signi cantly smaller space than the signal size. In particular, our algorithm is suited to high performance needs where the per-item processing time must be minimized. Previous algorithms either used large space, i.e., N, or worked longer, i.e., N log 1 N total time over the N data items. Our algorithm is the rst that simultaneously uses small space as well as runs fast, taking O1 worst case time for per-item processing. In addition, our algorithm is quite simple.
doi:10.1007/3-540-45465-9_58 fatcat:33aezxdlafchpbcax6h6obez6m