A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
AI-Ckpt
2013
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13
With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional
doi:10.1145/2493123.2462918
fatcat:qxyhp3sverbcrindfgtqzd5tcq