Revisiting simultaneous consensus with crash failures

Yoram Moses, Michel Raynal
<span title="">2009</span> <i title="Elsevier BV"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/rwjg7tprafhufajayuvxj2q4n4" style="color: black;">Journal of Parallel and Distributed Computing</a> </i> &nbsp;
This paper addresses the "consensus with simultaneous decision" problem in a synchronous system prone to t process crashes. This problem requires that all the processes that do not crash decide on the same value (consensus) and that all decisions are made during the very same round (simultaneity). So, there is a double agreement, one on the decided value (data agreement) and one on the decision round (time agreement). This problem was first defined by Dwork and Moses who analyzed it and solved
more &raquo; ... t using an analysis of the evolution of states of knowledge in a system with crash failures. The current paper presents a simple algorithm that optimally solves simultaneous consensus. Optimality means in this case that the simultaneous decision is taken in each and every run as soon as any protocol decides, given the same failure pattern and initial value. The design principle of this algorithm is simplicity, a first-class criterion. A new optimality proof is given that is stated in purely combinatorial terms. Décision simultanée en environement synchrone avec crash de processus Résumé : Ce rapport présente un algorithme de consensus pour un système synchrone avec crash de processus, dans lequel les processus qui décident le fontà la même ronde de calcul. L'accent erst mis sur la simplicité de conception de cet algorirthme. Mots clés : Système synchrone, algorithme distribué, consensus, crash de processus, décision simultaée, modèle de calcul fondé sur les rondes, système synchrone. The consensus problem Fault-tolerant systems often require a means by which processes or processors can arrive at an exact mutual agreement of some kind [15] . If the processes defining a computation have never to agree, that computation is actually made up of a set of independent computations, and consequently is not an inherently distributed computation. The agreement requirement is captured by the consensus problem that is one of the most important problems of fault-tolerant distributed computing. It actually occurs every time entities (usually called agents, processes -the word we use in the following-, nodes, sensors, etc.) have to agree. The consensus problem is surprisingly simple to state: each process is assumed to propose a value, and all the processes that are not faulty have to agree/decide (termination), on the same value (agreement), that has to be one of the proposed values (validity). The failure model considered in this paper is the process crash model. While consensus is impossible to solve in pure asynchronous systems despite even a single process crash [6] ("pure asynchronous systems" means systems in which there is no upper bound on process speed and message transfer delay), it can be solved in synchronous systems (i.e., systems where there are such upper bounds) whatever the number n of processes and the number t of process crashes (t < n ). An important measure for a consensus algorithm is the time it takes for the non-faulty processes to decide. As a computation in a synchronous system can be abstracted as a sequence of rounds, the time complexity of a synchronous consensus algorithm is measured as the minimal number of rounds (R t ) a process has to execute before deciding, in the worst case scenario. It has been shown (see, e.g., in [5, 12] ) that R t = t + 1 . Moreover, that bound is tight: there exist algorithms (e.g., see [1, 9, 16] ) where no process ever executes more than R t rounds (these algorithms are thus optimal with respect to that bound). While t + 1 rounds are needed in the worst case scenario, the major part of the executions have few failures or are even failure-free. So, an important issue is to be able to design early deciding algorithms, i.e., algorithms that direct the processes to decide "as early as possible" in good scenarios. Let f, 0 f t, be the number of actual process crashes in an execution. It has been shown that the lower bound on the number of rounds is then R t f = min(f + 2 t + 1 ) (e.g., [2, 12, 17] ). As before, this bound is tight: algorithms in which no process ever executes more than R t f exist (e.g., see [2, 7, 16] ). Simultaneous decision Consensus agreement is a data agreement property, namely the processes have to agree on the same value. According to the actual failure pattern, and the way this pattern is perceived by the processes, it is possible for several processes to decide at distinct rounds. The only guarantee lies in the fact that this round can be bounded by R t (or R t f ). This uncertainty on the set of round numbers at which the processes decide, can be a serious drawback for the real-time oriented applications where agreement is required, not only on the decided value, but also on the time the decision is taken. More precisely, these applications require that the processes decide on the same value (data agreement), during the very same round (time agreement). This property is also called simultaneous decision. Among the algorithms that ensure simultaneous decision, there are trivially all the "classical" consensus algorithms where all the processes that do not crash decide systematically at the end of the round R t = t + 1 . This observation suggests immediately the following question: "As far as the simultaneous decision property is concerned, are there early deciding algorithms, i.e., algorithms whose maximal number of rounds in the worst case scenario can be determined from f (and t)?" Unfortunately, it is shown in [2] that the answer to that question is negative: R t = t + 1 rounds is the best that can be done when both the parameters t and f are considered. At first glance, this can appear as counter-intuitive as it states that t + 1 is a bound for simultaneous decision whatever the value of f (i.e., even when no process crashes)! Early simultaneous decision So, given an execution, a more refined analysis requires to consider not the parameters PI n˚1885
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.jpdc.2009.01.001">doi:10.1016/j.jpdc.2009.01.001</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/s34xmst4yrgzdlmgq7e7ugje6e">fatcat:s34xmst4yrgzdlmgq7e7ugje6e</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20151022100038/https://hal.inria.fr/inria-00260643/file/PI-1885.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/e4/65/e465a1fda89b0ddaa0185222e11999cf7680baba.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1016/j.jpdc.2009.01.001"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> elsevier.com </button> </a>