A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain ... Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime ... We summarize our contributions as follows: • We introduce a series of design principles that enable efficient fine-grain asynchronous checkpointing of deep learning models. ...doi:10.1109/ccgrid49817.2020.00-76 dblp:conf/ccgrid/NicolaeLWBDC20 fatcat:s4565nfzczhfzmk4gir3tgkt64
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)
: Towards Scalable Asynchronous Checkpointing of Deep Learning Models 172 Bogdan Nicolae (Argonne National Laboratory), Jiali Li (University of Tennessee, Knoxville), Justin M. ... Load Imbalance in Data Processing for Large-Scale Deep Learning 262 Sarunya Pumma (Virginia Tech), Daniele Buono (IBM T.J. ...doi:10.1109/ccgrid49817.2020.00004 fatcat:czp4goqj2vavrbab2bzevxsmd4