Failure Detection and Randomization: A Hybrid Approach to Solve Consensus

Marcos Kawazoe Aguilera, Sam Toueg
1998 SIAM journal on computing (Print)  
We present a consensus algorithm that combines unreliable failure detection and randomization, two well-known techniques for solving consensus in asynchronous systems with crash failures. This hybrid algorithm combines advantages from both approaches: it guarantees deterministic termination if the failure detector is accurate, and probabilistic termination otherwise. In executions with no failures or failure detector mistakes, the most likely ones in practice, consensus is reached in only two
more » ... ynchronous rounds. In particular, [7] presents a consensus algorithm with the following features. Even if the information provided by the failure detectors is completely wrong, the algorithm never violates safety, i.e., no two processes ever decide differently. During "good" periods, when the failure detectors are reasonably accurate, processes reach consensus within few asynchronous rounds; on the other hand, when a "bad" period occurs, i.e., when failure detectors lose their accuracy, the consensus algorithm may stop making progress until the bad period is over. Such an algorithm is useful because in practice good periods tend to be long while bad ones tend to be rare and short. However, long bad periods do occasionally occur, and each time this happens the consensus algorithm of [7] can be delayed for a long time. In this paper, we seek an algorithm that terminates quickly when failure detection is accurate (i.e., during good periods) and that makes progress and terminates, albeit more slowly, even if failure detection is inaccurate (i.e., during bad periods). We achieve this goal by combining failure detection with randomization -another technique that was used to solve consensus in asynchronous systems [4] . In this hybrid approach, randomization "kicks in" as a back-up to failure detection when failure detectors are inaccurate. Further discussion of the relative merits of failure detection, randomization, and this hybrid approach is postponed to Section 7. The idea of combining randomization and failure detection to solve consensus in asynchronous systems first appeared in [12] . A related idea, namely, combining randomization and deterministic algorithms to solve consensus in synchronous systems was explored in [15, 25] . A brief comparison with our results is given in Section 8.
doi:10.1137/s0097539796312915 fatcat:blognanfrngc5hb3wbsukk5wc4