Diamond in the rough

Graham Cormode, Flip Korn, S. Muthukrishnan, Divesh Srivastava
2004 Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04  
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions ("heavy hitters"). Such Hierarchical Heavy Hitters (HHHs) were previously
more » ... as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions. We identify and study two variants of HHHs for multi-dimensional data, namely the "overlap" and "split" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions.
doi:10.1145/1007568.1007588 dblp:conf/sigmod/CormodeKMS04 fatcat:3a7d7qxt3jftxdp536if35ezqm