Approximate continuous querying over distributed streams

Graham Cormode, Minos Garofalakis
2008 ACM Transactions on Database Systems  
While traditional database systems optimize for performance on one-shot query processing, emerging largescale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In this
more » ... paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributed-streams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, and the communication cost across the network. In a nutshell, our algorithms rely on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication-and space/time-efficient solutions. The end result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multi-join aggregates, and approximate wavelet representations), thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model. Experiments with real data validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees. query. This has led to a very successful industry of database engines optimized for supporting complex, one-shot SQL queries over large amounts of data. Recent years, however, have witnessed the emergence of a new class of large-scale event monitoring applications that pose novel data-management challenges. In one class of applications, monitoring a large-scale system is a crucial aspect of system operation and maintenance. As an example, consider the Network Operations Center (NOC) for the IP-backbone network of a large ISP (such as Sprint or AT&T). Such NOCs are typically impressive computing facilities, monitoring 100's of routers, 1000's of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fiber-cable utilizations to packet forwarding at routers, to VPNs and higher-level transport constructs). The NOC has to continuously track and correlate usage information from a multitude of monitoring points in order to quickly detect and react to hot spots and floods, failures of links or protocols, intrusions, and attacks. A different class of applications is one in which monitoring is the goal in itself. For instance, consider a wireless network of seismic, acoustic, and physiological sensors that are deployed for habitat, environmental, and health monitoring. The key objective for such systems is to continuously monitor and correlate sensor measurements for trend analysis, detecting moving objects, intrusions, or other adverse events. Similar issues arise in sophisticated satellite-based systems that do atmospheric monitoring for weather patterns. A closer examination of such monitoring applications allows us to abstract a number of common characteristics. First, monitoring is continuous, that is, we need real-time tracking of measurements or events, not merely one-shot responses to sporadic queries. Second, monitoring is inherently distributed, that is, the underlying infrastructure comprises several remote sites (each with its own local data source) that can exchange information through a communication network. This also means that there typically are important communication constraints owing to either network-capacity restrictions (e.g., in IP-network monitoring, where the volumes of collected utilization and traffic data can be huge [Cranor et al. 2003 ]), or power and bandwidth restrictions (e.g., in wireless sensor networks, where communication overhead is the key factor in determining sensor battery life [Madden et al. 2003] ). Furthermore, each remote site may see a high-speed stream of data and has its own local resource limitations, such as storage-space or processing-time constraints. This is certainly true for IP routers (that cannot possibly store the log of all observed packet traffic at high network speeds), as well as wireless sensor nodes (that, even though they may not observe large data volumes, typically have very little memory onboard). Another key aspect of large-scale event monitoring is the need for effectively tracking queries that combine and/or correlate information (e.g., IP traffic or sensor measurements) observed across the collection of remote sites. For instance, tracking the result size of a join (the "workhorse" correlation operator in the relational world) over the streams of fault/alarm data from two or more IP routers (e.g., with a join condition based on their observed timestamp values) can allow network administrators to effectively detect correlated fault events at the routers, and, perhaps, also pinpoint the root-causes of specific faults in real time. As another example, consider the tracking of a two-or three-dimensional histogram summary of the traffic-volume distribution observed across the edge routers of a large ISP network (along axes such as time, source/destination IP address, etc.); clearly, such a histogram could provide a valuable visualization tool for effective circuit provisioning, detection of anomalies and DoS attacks, and so on. Interestingly, when tracking ACM Transactions on Database Systems, Vol. V, No. N, January 2008. Approximate Continuous Querying over Distributed Streams · 3 statistical properties of large-scale systems, answers that are precise to the last decimal are typically not needed; instead, approximate query answers (with reasonable guarantees on the approximation error) are often sufficient, since we are typically looking for indicators or patterns, rather than precisely-defined events. This works in our favor, allowing us to effectively tradeoff efficiency with approximation quality. Prior Work. Given the nature of large-scale monitoring applications, their importance for security as well as daily operations, and their general applicability, surprisingly little is known about solutions for many basic distributed-monitoring problems. The bulk of recent work on data-stream processing has focused on developing space-efficient, one-pass algorithms for performing a wide range of centralized, one-shot computations on massive data streams; examples include computing quantiles [Greenwald and Khanna 2001], estimating distinct values [Gibbons 2001], and set-expression cardinalities [Ganguly et al. 2003], counting frequent elements (i.e., "heavy hitters") [Charikar et al. 2002; Cormode and Muthukrishnan 2003; Manku and Motwani 2002], approximating large Haar-wavelet coefficients [Gilbert et al. 2001], and estimating join sizes and stream norms [Alon et al. 1999; Alon et al. 1996; Dobra et al. 2002] . As already mentioned, all the above methods work in a centralized, one-shot setting and, therefore, do not consider communicationefficiency issues. More recent work has proposed methods that carefully optimize site communication costs for approximating different queries in a distributed setting, including quantiles [Greenwald and Khanna 2004] and heavy hitters [Manjhi et al. 2005] ; however, the underlying assumption is that the computation is triggered either periodically or in response to a one-shot request. Such techniques are not immediately applicable for continuous-monitoring, where the goal is to continuously provide real-time, guaranteedquality estimates over a distributed collection of streams. It is important to realize that each of the dimensions of our problem (distributed, continuous, and space-constrained) induce specific technical bottlenecks. For instance, even efficient streaming solutions at individual sites can lead to constant updates on the distributed network and become highly communication-inefficient when they are directly used in distributed monitoring. Likewise, morphing one-shot solutions to continuous problems entails propagating each change and recomputing the solutions which is communication inefficient, or involves periodic updates and other heuristics that can no longer provide real-time estimation guarantees. Prior research has looked at the monitoring of single values, and building appropriate models and filters to avoid propagating updates if these are insignificant compared to the value of a simple aggregate (e.g., to the SUM of the distributed values). ] propose a scheme based on "adaptive filters" -that is, bounds around the value of distributed variables, which shrink or grow in response to relative stability or variability, while ensuring that the total uncertainty in the bounds is at most a user-specified bound δ. [Jain et al. 2004] propose building a Kalman Filter for individual values, and only propagating an update in a value if it falls more than δ away from the predicted value. The BBQ system [Deshpande et al. 2004 ] builds a dynamic, multi-dimensional probabilistic model of a set of distributed sensor values (viewed as random variables) to drive acquisitional query processing. Given a simple SQL-style query, the system determines whether it is possible to answer the query only from the model information, or whether it is necessary to poll certain locations for up-to-date information. This was extended to the continuous case in the Ken system [Chu et al. 2006 ], which ensured that the probabilistic model at the central
doi:10.1145/1366102.1366106 fatcat:v724jii3c5dbbl36gp6p3ctvnu