Supporting views in data stream management systems
Thanaa M. Ghanem, Ahmed K. Elmagarmid, Per-Åke Larson, Walid G. Aref
2010
ACM Transactions on Database Systems
In Relational database nlanageme~~t systems, views supplement basic query constructs to cope with the demand for "higher-level" views of data. h?oreover! in traditional query optimization, answering a query using aset of existing materialized views can yield a more efficient query execution plan. Due to their effectiveness, views are attractive to data stream management systems. In order to support views over stl.eams: a. data. stream ~nanagement system should employ a query language that
more »
... query composition -compositio~~ means the ability t o compose complex queries from simpler queries. Prior work on languages to express continuous queries over streams has defined a stream as a sequence of tuples that represents an infinite append-only relation. This paper shows that composition of queries, and hence supporting views, is not possible in the append-only stream model. Then, the paper proposes the Synchronized SQL (01-SyncSQL) query language that defines a stream as a sequence of modify operations (i.e., insert, update, and delete) against a relation with a specified scl~ema. Inputs and outputs in any SyncSQL query are interpreted in the sanie way and, hence, SyncSQL expressions can be composed. An important issue in contii~uous queries over data streams is t l~e frequency by which the answer gets refreshed and the conditions that trigger the refresh. Coarser periodic refresh requirements are typically expressed as sliding-window queries. In this paper, the sliding-window approach is generalized by introducing the synchronization principle that empowers SyncSQL with a formal mechanism t o express queries wit11 arbitrary refresh conditio~~s. After introducing the sen~antics and syntax, we lay the algebraic foundation for SyncSQL and propose a query matching algorithm for deciding containment of SyncSQL expressions. Efficient execution of continuous queries is a key requirement in streaming applications. Hence, this papel-introduces the Nile-SyncSQL prototype server to support SyncSQL queries, and hence supports views over streams. Nile-SyncSQL employs a pipelined incre~nental evaluation paradigm in w l~i c l~ the query pipeline consists of a set of differential operators. We develop a cost model to estimate l l~e cost of SyncSQL query execution pipelines. T l~e cost model is based on e s t i~n a t i~~g the number of tuples that are processed by the various operators in the pipeline. The cost model is used to choose the best execution plan fro111 a set of different plans for the same query (or t l~e same set of queries). \Ve conduct an experimental study to evaluate t l~e performance of the proposed Nile-SyncSQL prototype server. T l~e experimental results are twofold: (1) sl~owing the effectiveness of the proposed Nile-SyncSQL framework to support coiitinuous queries over data streams; and (2) validating the proposed cost 111odel and sho\ving significant performance gains when views are enabled in data stream management systems. Categories a~~d Subject Descriptors: ... [ D a t a S t r e a m M a n a g e m e n t Systems]: . AChl T r a n s a c t i o n s o n Dat,abnsc S y s t c m s . In Relational database management systems, views supplement basic query constructs to cope with the demand for "higher-lever' views of data. lVloreover, in traditional query optimization, answering a query using a set of existing materialized views can yield a more efficient query execution plan. Due to their effectiveness, views are attractive to data stream management systems. In order to support views over streams, a data stream management system should employ a query language that allows query composition -composition means the ability to compose complex queries from simpler queries. Prior work on languages to express continuous queries over streams has defined a stream as a sequence of tuples that represents an infinite append-only relation. This paper shows that composition of queries, and hence supporting views, is not possible in the append-only stream model. Then, the paper proposes the Synchronized SQL (or SyncSQL) query language that defines a stream as a sequence of modify operations (i.e., insert, update, and delete) against a relation with a specified schema. Inputs and outputs in any SyncSQL query are interpreted in the same way and, hence, SyncSQL expressions can be composed. An important issue in continuous queries over data streams is the frequency by which the answer gets refreshed and the conditions that trigger the refresh. Coarser periodic refresh requirements are typically expressed as sliding-window queries. In this paper, the sliding-window approach is generalized by introducing the synchronization principle that empowers SyncSQL with a formal mechanism to express queries with arbitrary refresh conditions. After introducing the semantics and syntax, we lay the algebraic foundation for SyncSQL and propose a query matching algorithm for deciding containment of SyncSQL expressions. Efficient execution of continuous queries is a key requirement in streaming applications. Hence, this paper introduces the Nile-SyncSQL prototype server to support SyncSQL queries, and hence supports views over streams. Nile-SyncSQL employs a pipelined incremental evaluation paradigm in which the query pipeline consists of a set of differential operators. We develop a cost model to estimate the cost of SyncSQL query execution pipelines. The cost model is based on estimating the number of tuples that are processed by the various operators in the pipeline. The cost model is used to choose the best execution plan from a set of different plans for the same query (or the same set of queries). \Ve conduct an experimental study to evaluate the performance of the proposed Nile-SyncSQL prototype server. The experimental results are twofold: (1) showing the effectiveness of the proposed Nile-SyncSQL framework to support continuous queries over data streams; and (2) validating the proposed cost model and showing significant performance gains when views are enabled in data stream management systems. Query languages in the streaming literature (e.g., [Arasu et a1. 2006; Carneyet a1. 2002; Chandrasekaran et a1. 2003; Cranor et a1. 2003; StreamSQL ; ESL ]) define a stream as a sequence of tuples that represents an infinite append-only relation. Languages based on the append-only model are not closed, that is, the result of a query expression is not necessarily an append-only relation. Not being closed has the negative effect that query expressions cannot be freely composed, i.e., one cannot express a query in terms of one or more sub-queries. Composition is a fundamental property of query languages (e.g., SQL) 1 and it requires that query inputs and outputs be interpreted in the same way. To support continuous query composition, and hence to SUppOlt views over streams, the following challenges need to be addressed by continuous query languages. Challenge 1-Using streams to represent the output of continuous queries that produce non-append-only output: A continuous query may not be able to produce an append-only output relation even when the input streams represent append-only relations. For example, consider an application that monitors a parking lot where two sensors continuously monitor the lot's entrance and exit. The sensors generate two streams of identifiers, say S1 and S2, for vehicles entering and exiting the lot, respectively. A reasonable query in this environment is ACl\l Transactions on Database Systems. Vol. V, No. N. November 2007. 4 -T. M. Ghanem et al. P1: "Continuously keep track of the iden.tifiers of all vehicles inside the parking lot". The answer to P1 is a view that a t any time point, say T, coiltains the identifiers of vehicles that are inside the parking lot. S1 can be modeled as a streain that inserts tuples into an append-only relation, say %(S1), and similarly, S2 inserts tuples into the append-only relation %(S2). Then: PI can be regarded as a materialized view that is defined by the set-difference between the two relations %(S1) and %(S2). As tuples arrive into S1 and S2, the corresponding relations are modified, a.nd the relation representing the result of P1 is updated to reflect the changes in the inputs. The result of P1 is updated by i.nsertin,g identifiers of vehicles entering the lot and deleting identifiers of vehicles exiting the lot. Notice that although the input relations in P1 change by only insertiilg tuples (i.e.: are append only), the output of Pl changes by both insertions and deletions. The deletions in Pi's output are due to the set-difference operation. P1's output cannot be represented as an append-only stream. In order to represent P1's output as a stream, we should be able to represent two different types of stream tuples (one t,ype of streain tuples t o represent the insertions in the output and the other type of stream tuples to represent the deletions). The commonly used sliding-windo~v inodel is another source of deletions in the outputs of queries over streams [Arasu et al. 20061. Tuples need to be deleted from the output of a. sliding-window query because input tuples expire as the window slides. Challenge 2 -Similar i n t e r p r e t a t i o n of q u e r y i n p u t s a n d o u t p u t : To enable query composition, query inputs and output should be interpreted in the same way so that the output of one query can be used a.s input to another query. Similar interpretation of query inputs and output is not always possible in the append-only stream model. For example, the output of query P1 can be produced either as (1) a complete answer, or as (2) an in.crementa1 answer. In the case of a complete a.nswer (case 1): a.t any time point T, the issuer of P1 sees a state, i.e., a relation containing identifiers of all vehicles inside the lot a,t time T. In the case of an incremental answer (case 2): the issuer of P1 receives a streain that represents the changes (i.e., insertions and deletions) in the sta.te. The output in the incremental case is interpreted in the same way as the inputs, namely, as a streain tl1a.t represents modifications to an underlying relation. Hornlever, PI7s incre1ne1lta.1 answer cannot be produced or con.sum.ed by a query in a language that models a strea.m as an append-only relation. Existing languages may produce output streams from P1 but the output streams are interpreted differently from the input streams. For example: the output may be modeled as a stream representing a conca.tenation of serializations of the complete answer (e.g., RStxea.111 in CQL [Arasu et al. 20061, and the output of mrindow queries in TelegraphCQ [Chandrasekaran et al. 20031). Alternatively, CQL divides the output into two append-only streams such that one stream represents the insertions in the output while the second strea.111 represents the deletions (i.e., 1Strea.m aild DStream). Consider the following query, P2: from the sa.me application: "Group the vehicles inside th,e parking lot by type (e.9.. t7xcks: cars. or buses). Continuously keep track of the n.umber of vehicles in, each group". By analyzing the two queries, P1 and P2, it is obvious tha.t P2 is an aggregate query over P17s output. This observation AChI Transactions on Database Systems. Vol. V. No. N: Noveniber 2007. (4) Refresh conditions are restricted to be either time-or tuple-based.
doi:10.1145/1670243.1670244
fatcat:juql6vcsdbgwvhgmk7jfwlv7mm