Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Graham Cormode
2011 Foundations and Trends in Databases  
Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key
more » ... synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the tradeoffs between the different synopsis types. 1.3 Outline 9 Wavelets The wavelet synopsis is conceptually close to the histogram summary. The central difference is that, whereas histograms primarily produce buckets that are subsets of the original data-attribute domain, wavelet representations transform the data and seek to represent the most significant features in a wavelet (i.e., "frequency") domain, and can capture combinations of high and low frequency information. The most widely discussed wavelet transformation is the Haar-wavelet transform (HWT), which can, in general, be constructed in time linear in the size of the underlying data array. Picking the B largest HWT coefficients results in a synopsis that provides the optimal L 2 (sum-squared) error for the reconstructed data. Extending from one-dimensional to multi-dimensional data, as with histograms, provides more definitional challenges. There are multiple plausible choices here, as well as algorithmic challenges in efficiently building the wavelet decomposition. The core AQP task for wavelet summaries is to estimate the answer to range sums. More general SPJ (select, project, join) queries can also be directly applied on relation summaries, to generate a summary of the resulting relation. This is made possible through an appropriatelydefined AQP algebra that operates entirely in the domain of wavelet coefficients. Recent research into wavelet representations has focused on error guarantees beyond L 2 . These include L 1 (sum of errors) or L ∞ (maximum error), as well as relative-error versions of these measures. A fundamental choice here is whether to restrict the possible coefficient values to those arising under the basic wavelet transform, or to allow other (unrestricted) coefficient values, specifically chosen to reduce the target error metric. The construction of such (restricted or unrestricted) wavelet synopses optimized for non-L 2 error metrics is a challenging problem. Sketches Sketch techniques have undergone extensive development over the past few years. They are especially appropriate for streaming data, in which large quantities of data flow by and the sketch summary must Mathematical Essentials of Sampling 29 Chebyshev Bounds. However, normality is never guaranteed. The authors have generally found that statisticians are more accepting of distributional assumptions than are computer scientists, who tend to be more conservative. If one eschews distributional assumptions, then distribution-free bounds can be used instead. These are looser, but more comforting. One common bound is due to Chebyshev's inequality, which implies that for an unbiased estimator Y ,
doi:10.1561/1900000004 fatcat:wk7razxkmzcv7fzczftlohblwa