PIP: A database system for great and small expectations

Oliver Kennedy, Christoph Koch
2010 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)  
Estimation via sampling out of highly selective join queries is well known to be problematic, most notably in online aggregation. Without goal-directed sampling strategies, samples falling outside of the selection constraints lower estimation efficiency at best, and cause inaccurate estimates at worst. This problem appears in general probabilistic database systems, where query processing is tightly coupled with sampling. By committing to a set of samples before evaluating the query, the engine
more » ... astes effort on samples that will be discarded, query processing that may need to be repeated, or unnecessarily large numbers of samples. We describe PIP, a general probabilistic database system that uses symbolic representations of probabilistic data to defer computation of expectations, moments, and other statistical measures until the expression to be measured is fully known. This approach is sufficiently general to admit both continuous and discrete distributions. Moreover, deferring sampling enables a broad range of goal-oriented sampling-based (as well as exact) integration techniques for computing expectations, allows the selection of the integration strategy most appropriate to the expression being measured, and can reduce the amount of sampling work required. We demonstrate the effectiveness of this approach by showing that even straightforward algorithms can make use of the added information. These algorithms have a profoundly positive impact on the efficiency and accuracy of expectation computations, particularly in the case of highly selective join queries.
doi:10.1109/icde.2010.5447879 dblp:conf/icde/KennedyK10 fatcat:tc3qgxxw25auzirg5pu2h7rjqy