Scalable Distributed Consensus to Support MPI Fault Tolerance

Darius Buntinas
2012 2012 IEEE 26th International Parallel and Distributed Processing Symposium  
As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a
more » ... distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum's fault-tolerance working group. The algorithm was implemented and evaluated on a 4,096-core Blue Gene/P. The implementation was able to perform a full-scale distributed consensus in 305 µs and scaled logarithmically.
doi:10.1109/ipdps.2012.113 dblp:conf/ipps/Buntinas12 fatcat:bbyqwnz3k5e6pfiwcza5yokble