Dynamic querying of streaming data with the dQUOB system

B. Plale, K. Schwan
2003 IEEE Transactions on Parallel and Distributed Systems  
Data streaming has established itself as a viable communication abstraction in data-intensive parallel and distributed computations, occurring in applications such as scientific visualization, performance monitoring, and large-scale data transfer. A known problem in large-scale event communication is tailoring the data received at the consumer. It is the general problem of extracting data of interest from a data source, a problem that the database community has successfully addressed with SQL
more » ... eries, a time tested, user-friendly way for non-computer scientists to access data. Leveraging the efficiency of query processing provided by relational queries, the dQUOB system provides a conceptual relational data model and SQL query access over distributed data streams. Queries can extract data, combine streams, and create new streams. The language augments queries with an action to enable more complex data transformations, such as Fourier transforms. The dQUOB system has been applied to two large-scale distributed applications: a safety critical autonomous robotics simulation, and scientific software visualization for global atmospheric transport modeling. In this paper we present the dQUOB system and the results of performance evaluation undertaken assess to its applicability in data-intensive wide-area computations where the benefit of portable data transformation must be evaluated against the cost of continuous query evaluation. Background. Data-intensive parallel and distributed computations have seen dramatic increases in scale along multiple dimensions, including numbers and types of data sources, numbers of users, and problem sizes. In the scientific domain, large-scale data-intensive applications exist in computational biology, tomography [28], remote visualization [11] , remote instrument control [27] , and distributed data analysis [2, 23, 6] . Beyond the scientific domain are rich media collaboration and pervasive computing. The scaling parallels and leverages the recent explosive growth of computers in everyday life. The Internet and high-bandwidth connectivity now reach a significant portion of the population; wireless communication and hand-held devices extend the reach even further. Clusters assembled from commodity PCs make it possible for institutions to provide a major collective distributed computational resource, as demonstrated by the NSF Distributed Terascale Facility (DTF). The impact on data-intensive computation has been dramatic. Scientists now acknowledge that remote location of a data set is no longer a barrier to its inclusion as a data source. The Computational Grid (or Grid for short) [12, 13] addresses the expanding computational base of distributed multi-user workstation clusters and supercomputers with a middleware infrastructure providing services such as communication, security, scheduling, and resource location. Within the class of data-intensive applications is a subclass of applications characterized by the use of data streaming for distributed communication. Rich media and remote visualization are well known examples, but data streams could exist between any loosely-coupled, autonomous components that communicate asynchronously. In all such applications, streams of events flow from providers to consumers, where an event has no size restriction and contains timestamped data about the behavior or state of a computational entity, physical instrument, or user. Event streaming can be initiated by a data provider (push model) or by a consumer (pull model). Publish-subscribe event communication packages like ECho [8] implement event channels similar to those provided by CORBA [14] but focused on large-scale event flows. The publish-subscribe semantics allow any number of consumers to subscribe to an event channel; the provider need not have knowledge of the location or number of users. Data streams have been treated by others in [10], [4], and [19]. A problem in data streaming applications surfaces when the applications scale in number of data providers and data consumers, and in richness of information exchanged. Specifically, needs mismatches begin to occur. Needs mismatch exists when the data sent by the supplier is not precisely the amount or of a form needed by the user. For instance, scientific data generated by an atmospheric transport model may need to undergo a Fourier transform prior to rendering. Similarly, a user may not be interested in all 3D grid data generated, but in an aggregation of the data; a simple 2D plot of a one month trend, for instance. Approach and Contributions. Our work addresses the needs mismatch problem in large-scale data flows with a novel approach to selectively extracting data from data streams. Earlier work done by our group has established the benefits of encapsulating needs mismatch-style computations into logical tasks that can be associated with data streams to maximize proximity or availability of resources [25, 5, 17, 21] . This paper extends these notions by providing the user with an intuitive relational model for thinking about needs-mismatch computations and a prototype system for creating these computations and embedding them into a data stream. Application of the work first to safety-critical systems[24] followed by scientific computing applications [25] has provided
doi:10.1109/tpds.2003.1195413 fatcat:wpzhpaijrjaqndonrpw3zt3tzu