A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2016; you can also visit the original URL.
The file type is application/pdf
.
Assessing HPC Failure Detectors for MPI Jobs
2012
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Reliability is one of the challenges faced by exascale computing. Components are poised to fail during largescale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For the latter techniques, this paper studies the challenge of fault detection. This work contributes a study on generic fault detection capabilities at the MPI level and beyond. The objective is to assess
doi:10.1109/pdp.2012.11
dblp:conf/pdp/KharbasKHM12
fatcat:mpvstwgbmndt5cbxlo25krwt5m