Filters








3 Hits in 1.9 sec

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory [article]

Rohan Garg, Apoore Mohan, Michael Sullivan, Gene Cooperman
2018 arXiv   pre-print
A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes.  ...  CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage.  ...  This work was partially supported by NSF Grants ACI-1440788 and OAC-1740218, and by Grant 2014-345 from a "Chaire d'attractivité" de l'IDEX, Université Fédérale Toulouse Midi-Pyrénées.  ... 
arXiv:1808.00117v1 fatcat:jozz2xwhczhknepgojuszrkm5q

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM [article]

Twinkle Jain, Gene Cooperman
2020 arXiv   pre-print
CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications.  ...  CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the  ...  We also thank Rohan Garg for conversations describing his earlier design of CRUM for CUDA.  ... 
arXiv:2008.10596v1 fatcat:yth3s343gfe4xjlsbilmqtu55q

Predictive Reliability and Fault Management in Exascale Systems

Ramon Canal, Carles Hernandez, Rafa Tornero, Alessandro Cilardo, Giuseppe Massari, Federico Reghenzani, William Fornaciari, Marina Zapater, David Atienza, Ariel Oleksiak, Wojciech PiĄtek, Jaume Abella
2020 ACM Computing Surveys  
Last, the CRUM framework presented in [74] , which also relies on a proxy-based approach along with new shadow page synchronization mechanisms, directly addresses the support for CUDA's unified virtual  ...  memory (UVM) available in the latest device generations, enabling fast asynchronous checkpointing for large-memory CUDA UVM applications and significantly reducing checkpointing overheads.  ... 
doi:10.1145/3403956 fatcat:77xcpnevmnc5jfpj6ynhwdng3m