Fundamentals of fault-tolerant distributed computing in asynchronous environments

Felix C. Gärtner
1999 ACM Computing Surveys  
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal
more » ... erently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
doi:10.1145/311531.311532 fatcat:jetckaxzdfchvkxpn6u67wahcy