A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is
A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. ... CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. ... This work was partially supported by NSF Grants ACI-1440788 and OAC-1740218, and by Grant 2014-345 from a "Chaire d'attractivité" de l'IDEX, Université Fédérale Toulouse Midi-Pyrénées. ...arXiv:1808.00117v1 fatcat:jozz2xwhczhknepgojuszrkm5q
CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. ... CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the ... We also thank Rohan Garg for conversations describing his earlier design of CRUM for CUDA. ...arXiv:2008.10596v1 fatcat:yth3s343gfe4xjlsbilmqtu55q
Last, the CRUM framework presented in  , which also relies on a proxy-based approach along with new shadow page synchronization mechanisms, directly addresses the support for CUDA's unified virtual ... memory (UVM) available in the latest device generations, enabling fast asynchronous checkpointing for large-memory CUDA UVM applications and significantly reducing checkpointing overheads. ...doi:10.1145/3403956 fatcat:77xcpnevmnc5jfpj6ynhwdng3m