Indexing for summary queries

Ke Yi, Lu Wang, Zhewei Wei
2014 ACM Transactions on Database Systems  
Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some
more » ... cs of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this article, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, and various sketches, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost. We illustrate the efficiency and usefulness of our designs through extensive experiments and a system demonstration. Indexing for Summary Queries: Theory and Practice 2:3 and R-trees, where an internal node stores the aggregate for its subtree. There is a large body of work on spatial data structures; please refer to the survey by Agarwal and Erickson [1999] and the book by Samet [2006]. When the data space forms an array, the data cube [Gray et al. 1997 ] is a classical structure for answering aggregation queries. However, all the past research, whether in computational geometry or databases, has only considered queries that return simple aggregates like count, sum, max (min), distinct count [Tao et al. 2004 ], top-k [Afshani et al. 2011, and median Jørgensen and Larsen 2011] . The problem of returning complex summaries has not been addressed. There is also a vast literature on various summaries in both the database and algorithms communities, motivated by the fact that simple aggregates cannot well capture the data distribution. These summaries, depending on the context and community, are also called synopses, sketches, or compressed representations. However, all past research has focused on how to construct a summary, either offline, in a streaming fashion, or over distributed data [Shrivastava et al. 2004; Agarwal et al. 2012 ], on the entire dataset. The indexing problem has not been considered, where the focus is to intelligently compute and store auxiliary information in the index at precomputation time, so that a summary on a requested subset of the records in the database can be built quickly at query time. The problem of how to maintain a summary as the underlying data changes, namely, under insertions and deletions of records or under the sliding window semantics [Datar et al. 2002] , has also been extensively studied. But this shall not be confused with our dynamic index problem. The former maintains a single summary for the entire dynamic dataset, while the latter aims at maintaining a dynamic structure from which a summary for any queried subset can be extracted, which is more general than the former. Of course for the former, there often exist small-space solutions, while for the indexing problem, we cannot hope for sublinear space, as a query range may be small enough so that the summary degenerates to the raw query results. Next we review some of the most fundamental and most studied summaries in the literature. Let D be a bag of items, and let f D (x) be the frequency of x in D. Heavy Hitters. An (approximate) heavy hitters summary allows one to extract all frequent items approximately, that is, for a user-specified 0 < φ < 1, it returns all items x with f D (x) > φ|D| and no items with f D (x) < (φ − ε)|D|, while an item x with (φ − ε)|D| ≤ f D (x) ≤ φ|D| may or may not be returned. A heavy hitters summary of size O(1/ε) can be constructed in one pass over D, using the MG algorithm [Misra and Gries 1982] or the SpaceSaving algorithm [Metwally et al. 2006 ]. Quantiles. The quantiles (a.k.a. the order statistics), which generalize the median, are important statistics about the data distribution. Recall that the φ-quantile, for 0 < φ < 1, of a set D of items from a totally ordered universe is the one ranked at φ|D| in D (for convenience, for the quantile problem it is usually assumed that there are no duplicates in D). A quantile summary contains enough information so that for any 0 < φ < 1, an ε-approximate φ-quantile can be extracted, that is, the summary returns a φ ′ -quantile, where φ − ε ≤ φ ′ ≤ φ + ε. A quantile summary has size O(1/ε) and can be easily computed by sorting D, and then taking the items ranked at ε|D|, 2ε|D|, 3ε|D|, . . . , |D|. In the streaming model where sorting is not possible, one could construct a quantile summary of the optimal O(1/ε) size with O((1/ε) log εN) working space, using the GK algorithm [Greenwald and Khanna 2001] . Sketches. Various sketches have been developed as a useful tool for summarizing massive data. In this article, we consider the two most widely used ones: the Count-Min
doi:10.1145/2508702 fatcat:nufgsoboqzamjmv2uaq7btyyk4