Implementing fault-tolerant services using the state machine approach: a tutorial

Fred B. Schneider
1990 ACM Computing Surveys  
Replication is needed for fault tolerance. -SM is a general method for replicating servers and coordinating clients interactions. -SM is a framework for understanding and designing replication management protocols. -State machine Definition: set of state variables set of commands that transform its state -Command is a atomic and deterministic program. -Clients make request to execute command. -Requests are in the form of triple (state machine, command, command operands) -Output from request
more » ... ut from request processing can be to an actuator,peripheral device or awaiting clients. -SM Semantics: -SM executes just upon request. -Output is completely determined by the sequence of requests -Output is completely independent of time and other activities in system -Every system can be structured as state machine and clients. Any thing that can be structured as procedure calls and procedure can be structured using SMs and clinets. -There is no need for clients to wait for output from SM. They execute next commands after the sending a request. -Failures: -Byzantine Failures : The components can exhibit arbitrary and malicious behavior. -Fail-stop: In response to a failure, the component changes to a state that permits other components to detect that a failure occurred then stops. -A system is t fault tolerant system if it satisfies its specification provided that no more than t components become faulty. -t fault tolerance is a measure to assess architectural fault tolerance (it is not dependent to reliability of components.) -t fault tolerant state machines can be implemented by replicating state machines and distributing this replicas to separate processors. Henceforth, we call the set of replicated state machines "ensemble". -In case of Byzantine failure, a t fault tolerant ensemble should have at least 2t+1 replicas. -In case of fail-safe failure, a t fault tolerant ensemble should have at least t+1 replicas. -Key issue in FS replication -Replica Coordination: All replicas receive and process the same sequence of requests. -Agreement: Every non-faulty SM receives every request -It requires two conditions: -IC1: All non-faulty processors agree on the same value. -IC2: If the transmitter (processor) is non-faulty, then all non-faulty processors use its value as the one on which they agree. -Algorithms that satisfy IC1 and IC2 is called "Byzantine agreement protocols", "reliable broadcast protocol" or "agreement protocol". -Order: Every non-faulty SM process the requests it receives in the same relative order. -This requirement is satisfied by assigning unique identifiers to requests. -A request is stable at a SM once no request from a correct client and bearing a lower unique identifier can be subsequently delivered. -Order implementation: A replica next processes the stable request with the smallest
doi:10.1145/98163.98167 fatcat:pw6w7mgncnamhmkhidvroz5nau