A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer,arXiv:2012.00825v1 fatcat:r3kw6fvx6ffylpwdoejts55g5i