A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Understanding the propagation of transient errors in HPC applications
2015
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI
doi:10.1145/2807591.2807670
dblp:conf/sc/AshrafGKDCB15
fatcat:7sh3rfrbd5bhvgudejetshh2by