A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Scalable Distributed Consensus to Support MPI Fault Tolerance
2012
2012 IEEE 26th International Parallel and Distributed Processing Symposium
As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a
doi:10.1109/ipdps.2012.113
dblp:conf/ipps/Buntinas12
fatcat:bbyqwnz3k5e6pfiwcza5yokble