The expressive power of snap-stabilization

Alain Cournier, Ajoy K. Datta, Stéphane Devismes, Franck Petit, Vincent Villain
2016 Theoretical Computer Science  
A snap-stabilizing algorithm, regardless of the initial configuration of the system, guarantees that it always behaves according to its specification. We consider here the locally shared memory model. In this model, we propose the first snap-stabilizing Propagation of Information with Feedback (PIF) algorithm for rooted networks of arbitrary connected topology which is proven assuming the distributed unfair daemon. Then, we use the proposed PIF algorithm as a key module in designing
more » ... zing solutions for some fundamental problems in distributed systems, such as Leader Election, Reset, Snapshot, and Termination Detection. Finally, we show that in the locally shared memory model, snap-stabilization is as expressive as self-stabilization by designing a universal transformer to provide a snapstabilizing version of any algorithm that can be self-stabilized with the transformer of Katz and Perry (Distributed Computing, 1993). Since by definition a snap-stabilizing algorithm is also self-stabilizing, self-and snap-stabilization have the same expressiveness in the locally shared memory model. This paper is an extended version of preliminary results [1, 2]. 1 Context Modern distributed systems are made of a large number of interconnected processes. Increasing the number of components (processes or links) in a distributed system means increasing the probability that some of these components fail during the execution of a distributed algorithm. Moreover, due to the large scale, human intervention to quickly repair failed components is also not possible. So, in this context, fault-tolerance, e.g., the ability of a distributed algorithm to endure faults, is mandatory. We consider a particular type of faults, called the transient faults. A transient fault occurs at an unpredictable time but does not result in a permanent hardware damage. Moreover, as opposed to intermittent faults, the frequency of transient faults is considered to be low. Consequently, network components affected by transient faults temporarily deviate from their specifications, e.g., some messages in a link may be lost, reordered, duplicated, or corrupted. As a result, a transient fault affects the state of the component in which it occurs. Hence, after a finite number of transient faults, the configuration of a distributed system can be arbitrary, i.e., process memories can be corrupted and communication links may contain corrupted messages. In 1974, Dijkstra [3] proposed a general paradigm called self-stabilization to enable the design of distributed systems tolerating any finite number of transient faults. Consider the first configuration after all transient faults cease. This configuration is arbitrary, but no other transient faults will ever occur from this configuration. By abuse of language, this configuration is referred to as arbitrary initial configuration of the system in the literature. Then, a self-stabilizing algorithm (provided that faults have not corrupted its code) guarantees that starting from an arbitrary initial configuration, the system recovers within finite time, without any external intervention, to a configuration from which its specification is (always) satisfied. Thus, self-stabilization makes no hypotheses on the nature or extent of transient faults that could hit the system, and the system recovers from the effects of those faults in a unified manner. Such versatility comes at a price, e.g., after transient faults cease, there is a finite period of time, called the stabilization phase, during which the safety properties of the system may be violated. Hence, self-stabilizing algorithms are mainly compared according to their stabilization time, the maximum duration of the stabilization phase. Several approaches have been introduced to offer more stringent guarantees than simple eventual recovery, e.g., fault-containment [4], superstabilization [5], and snap-stabilization [6, 7] . We focus here on the concept of snapstabilization, introduced by Datta et al in 1999 [6, 7] . Snap-stabilization is a stronger form of self-stabilization, as after transient faults cease, a snap-stabilizing system immediately resumes correct behavior, without any external intervention, provided the faults have not corrupted its code. More formally, starting from an arbitrary initial configuration (i.e., the first configuration after the end of the faults), a snap-stabilizing system (always) satisfies its specification. Hence, by definition, snap-stabilizing algorithms are self-stabilizing algorithms whose stabilization time is null. To illustrate the advantage of snap-stabilization compared with self-stabilization, we now consider the fundamental problem of termination detection. In this problem, any process p can be requested (by the application layer) to detect if some distributed algorithm X has terminated. More precisely, upon a request a process should initiate a query to know if X has terminated, and when p delivers an answer "yes" (resp., "no"), X "has terminated" (resp., "may not have terminated"). Let A self (resp., A snap ) be a self-stabilizing (resp., snap-stabilizing) algorithm for detecting the termination of some distributed algorithm X . Let p be any process. If A self starts from an arbitrary initial configuration and X eventually terminates, then all we know is that eventually only "yes" answers will be computed for all p's queries, and these answers will truly indicate that X did terminate. However, during the stabilization phase, it is possible that p delivers "yes" answers while X actually has not terminated. In other words, A self can compute false (or unsafe) answers several (but a finite number of) times before finally computing true/correct answers. In contrast, using A snap , starting from an arbitrary initial configuration, the very first answer delivered by p to any initiated query can be trusted to be a true answer. It is important to note that snap-stabilizing systems are not insensitive to transient faults. For example, using A snap , if a transient fault occurs between an initiated query and its associated answer, then the answer may not be correct, i.e., a process may deliver "yes" while X actually has not terminated. However, every answer returned to any query initiated after the end of faults will be correct. In contrast, A self just guarantees that only a finite, yet generally unbounded, number of false answers will be returned after faults cease. Related Work Since the seminal work of Bui et al [6, 7] , many snap-stabilizing solutions, dedicated to various problems and handling different network topologies, have been proposed. Most notably, numerous papers on snapstabilization deal with the Propagation of Information with Feedback (PIF) and depth-first token circulation problems. The former has been addressed in rooted tree networks [8, 9] and arbitrary connected rooted networks [10, 11, 12] . Proof. Assume by contradiction that before r execute B-action, a process p executes F -action at least twice while it is in Tree(r). After the first execution of F -action, p satisfies S p = F . Then, according to the algorithm, p must successively executes the P -, C-, and B-action before it executes F -action again. Then, any process q in Tree(r) satisfies S q = B when p executes C-action (Tree(r) is Dead by Lemma 4.9). Hence, r must initiate another broadcast by B-action so that p attaches to Tree(r) by B-action, a contradiction. 2 Using a reasoning similar to the one used for Corollary 4.2, we can deduce the following corollary from Lemma 4.22: Corollary 4.3 From any configuration, before r executes B-action, each process executes P -action and C-action at most once while it is in Tree(r). By Lemmas 4.21, 4.22, and Corollary 4.3, we get the following: Lemma 4.23 From any configuration, O(N ) actions of the PIF Part are executed in Tree(r) before r executes Baction. Lemma 4.24 From any configuration, Tree(r) generates O(∆ × N 2 ) actions of the Question Part before r executes B-action. Proof. Similar to the proof of Lemma 4.18, and Lemmas 4.17 and 4.21, it is easy to see that Tree(r) generates O(∆ × N 2 ) actions of the Question Part before r executes B-action. 2 By Lemma 4.23 and 4.24, we obtain the following result: Lemma 4.25 From any configuration, Tree(r) generates O(∆ × N 2 ) actions before r executes B-action. By Lemmas 4.20 and 4.25, we can claim the following: Theorem 4.5 From any configuration, r executes B-action in O(∆ × N 3 ) steps.
doi:10.1016/j.tcs.2016.01.036 fatcat:hrcwmxhq4rfb3h4ofbbcj4omjy