Filters








2 Hits in 2.1 sec

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Bogdan Nicolae, Jiali Li, Justin M. Wozniak, George Bosilca, Matthieu Dorier, Franck Cappello
2020 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)  
One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain  ...  Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime  ...  We summarize our contributions as follows: • We introduce a series of design principles that enable efficient fine-grain asynchronous checkpointing of deep learning models.  ... 
doi:10.1109/ccgrid49817.2020.00-76 dblp:conf/ccgrid/NicolaeLWBDC20 fatcat:s4565nfzczhfzmk4gir3tgkt64

CCGrid 2020 TOC

2020 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)  
: Towards Scalable Asynchronous Checkpointing of Deep Learning Models 172 Bogdan Nicolae (Argonne National Laboratory), Jiali Li (University of Tennessee, Knoxville), Justin M.  ...  Load Imbalance in Data Processing for Large-Scale Deep Learning 262 Sarunya Pumma (Virginia Tech), Daniele Buono (IBM T.J.  ... 
doi:10.1109/ccgrid49817.2020.00004 fatcat:czp4goqj2vavrbab2bzevxsmd4