Fault-Tolerance and High Availability in Data Stream Management Systems [chapter]

Susan Dumais, Magdalena Balazinska, Jeong-Hyon Hwang, Mehul A. Shah, Raimondo Schettini, Gianluigi Ciocca, Isabella Gagliardi, Manoranjan Dash, Poon Wei Koot, Benjamin Bustos, Tobias Schreck, Vassilis Plachouras (+37 others)
2009 Encyclopedia of Database Systems  
SYNONYMS None DEFINITION Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide-area network (WAN). Failures of processing nodes or failures in the underlying communication network can
more » ... ause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries. Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a variety of techniques that offer different levels of availability depending on application needs. All fault-tolerance methods rely on some form of replication, where the volatile state is stored in independent locations to protect against failures. This article describes several such methods for DSMSs that offer different trade-offs between availability and runtime overhead while maintaining consistency. For cases of network partitions, it outlines techniques that avoid stalling the query at the cost of temporary inconsistency, thereby providing the highest availability. This article focuses on failures within a DSMS and does not discuss failures of the data sources or client applications. HISTORICAL BACKGROUND Recently, DSMSs have been developed to support critical applications that must quickly and continuously process data as soon as it becomes available. Example applications include financial stream analysis and network intrusion detection (see KEY APPLICATIONS for more). Fault-tolerance and high availability are important for these applications because faults can lead to quantifiable losses. To support such applications, a DSMS must be equipped with techniques to handle both node and network failures. All basic techniques for coping with failures involve some kind of replication. Typically, a system replicates the state of its computation onto independently failing nodes. It must then coordinate the replicas in order to recover properly from failures. Fault-tolerance techniques are usually designed to tolerate up to a pre-defined number, k, of simultaneous failures. Using such methods, the system is then said to be k-fault tolerant. There are two general approaches for replication and coordination. Both approaches assume that the computation can be modeled as a deterministic state-machine [4, 11] . This assumption implies that two non-faulty computations that receive the same input in the same order will produce the same output in the same order. Hereafter, two computations are called consistent if they generate the same output in the same order. The first approach, known as the state-machine approach, replicates the computation on k + 1 ≥ 2 independent nodes and coordinates the replicas by sending the same input in the same order to all [11] . The details of how
doi:10.1007/978-0-387-39940-9_160 fatcat:urb74zfsjnc6tnz3y76r3jwwgi