A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2015; you can also visit the original URL.
The file type is application/pdf
.
A scalable double in-memory checkpoint and restart scheme towards exascale
2012
IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In
doi:10.1109/dsnw.2012.6264677
dblp:conf/dsn/ZhengNK12
fatcat:p56cp4bohzh7jli3rrfvtkb4sy