Distributed top-k aggregation queries at large

Thomas Neumann, Matthias Bender, Sebastian Michel, Ralf Schenkel, Peter Triantafillou, Gerhard Weikum
2009 Distributed and parallel databases  
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address
more » ... ee degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on T. Neumann ( ) · M. Bender · G. Weikum Distrib Parallel Databases (2009) 26: 3-27 order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network. Keywords Top-k · Distributed queries · Query optimization · Cost models 1 Introduction Motivation and problem statement Top-k query processing is a fundamental cornerstone of multimedia similarity search, ranked retrieval of documents from digital libraries and the Web, preference queries over product catalogs, and many other modern applications. Conceptually, top-k queries can be seen as operator trees that evaluate (SQL or XQuery) predicates over one or more tables, perform outer joins to combine multi-table data for the same entities or perform grouping by entities (e.g., by document ids), subsequently aggregate a "goodness" measure such as frequencies or IR-style scores, and finally output the top-k results with regard to this aggregation. Ideally, an efficient query processor would not read the entire input (i.e., all tuples from the underlying tables) but should rather find ways of early termination when the k best results can be safely determined, using techniques like priority queues, bounds for partially computed aggregation values, pruning intermediate results, etc. These issues have been intensively researched in recent years (e.g., [7, 10, 15, 20, 21, 28, 35, 40] ), and are now fairly well understood for a centralized setting with all data residing on the same server. The current state-of-the-art algorithms for distributed top-k querying [4, 9, 33, 44] address the peculiarities of a distributed setting (in particular communication cost), but fall short of being a perfect solution for really large-scale distributed settings (e.g., highly decentralized and dynamic peer-to-peer systems), where even other performance issues become critical and require different compromises. This paper develops novel techniques to address the peculiarities of such large-scale systems and shows their practical viability. Conceptually, the data we consider resides in a (virtual) table that is horizontally partitioned across many nodes in a wide-area network; partitionings are typically along the lines of value ranges, creation dates, or creators. The queries that we want to evaluate on the (virtual) union of all partitions compute the top-k globally most frequent, least frequent, or highest scoring items across the entire network. Further, we assume a monotonic aggregation function, such as most of the popular aggregation functions (maximum, minimum, (weighted) summation). This framework has important real-world applications: -Network monitoring over distributed logs [19] . Items are IP addresses, URLs, or file names in P2P file sharing, and queries could aggregate occurrence frequencies or transferred bytes. -Sensor networks with sensors that have local storage and are periodically polled [31] . Possible items are chemicals that contribute to water or air pollution, and the values represent actual measurements of their concentration. Typical aggregations are based on specific time periods (e.g., morning hour vs. evening hour). Distrib Parallel Databases (2009) 26: 3-27 5 -Mining of social communities and their behavior [18] . Typical items are specific user groups. Interesting aggregations consider frequencies of postings to different blogs, or "social tags" and ratings assigned to user-created content, or statistical information from query logs and click streams. 1.2 Computational model, assumptions Following [9, 33, 44], we consider a distributed system with m peers P j , j = 1, . . . , m. It is assumed that every node can communicate with every other nodepossibly with different network costs, but without any limitation of functionality. This can, if necessary, be assured by means of "proxy" nodes. Each peer P j owns a fragment of an abstract relation, containing items I and their corresponding (local) values v j (I ). Such pairs are accessible at each peer P j in sorted order by descending value, i.e., in a (physically or virtually) sorted list L j . These lists can be implemented by materializing local index lists, but other ways are conceivable, too. Notice that an item can, and usually does, appear in the lists of more than one peer. Often, some popular items (e.g., URLs or IP addresses in a network traffic log) appear in the lists of nearly all peers. A query q(k), initiated at a peer P init , aims at finding the k items with highest aggregated values V (I ) = Aggr P j v j (I ) over all peers P j . For the sake of concreteness, we will use summation for value aggregation throughout the paper, but weighted sums and other monotonic functions are supported analogously. Scanning the local list L j allows each peer P j to retrieve and ship a certain number of its locally highestvalue items. The receiving peer (e.g., P init ) can then employ a threshold algorithm [20, 21, 35] for value aggregation and determining whether previously unseen result candidates potentially qualify for the final top-k result, or if deeper scans or further probings of unknown values are needed to safely eliminate result candidates. All algorithms in this paper proceed in rounds [9, 33, 44] : in each round, requests are sent to certain network nodes to either scan local lists to a certain depth or to probe for an item's local value. The requestor subsequently collects and aggregates the results and updates its bookkeeping about top-k candidates. The most important resource to optimize is communication bandwidth, or equivalently, the number of item-value entries that are shipped over the network. In addition, but as secondary criteria, we also observe message latencies and processing loads incurred at the nodes. This work sets aside node failures during query execution. In case of temporary node failures or nodes leaving the system, we can adopt the method of [1], which proposes to send partial results directly to the query initiator, or we can apply a reorganization step for the affected portion of the query execution plan.
doi:10.1007/s10619-009-7041-z fatcat:qid6zibirvfm5c4i3xfkz2a4z4