Filters








88 Hits in 6.4 sec

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

Tonmoy Dey, Kento Sato, Bogdan Nicolae, Jian Guo, Jens Domke, Weikuan Yu, Franck Cappello, Kathryn Mohror
2020 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize checkpoint/restart configurations.  ...  In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations.  ...  MACHINE LEARNING FOR CHECKPOINT/RESTART The count and interval are two of the most important checkpoint parameters for optimizing the CR configuration.  ... 
doi:10.1109/ipdpsw50202.2020.00174 dblp:conf/ipps/DeySNGDYCM20 fatcat:j4ifj4zvffapbnb5rvyrzg7tvy

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Bogdan Nicolae, Jiali Li, Justin M. Wozniak, George Bosilca, Matthieu Dorier, Franck Cappello
2020 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)  
However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization  ...  Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime  ...  RELATED WORK Multi-level checkpoint-restart is a popular approach to leverage multiple storage levels in the context of HPC checkpointing.  ... 
doi:10.1109/ccgrid49817.2020.00-76 dblp:conf/ccgrid/NicolaeLWBDC20 fatcat:s4565nfzczhfzmk4gir3tgkt64

Table of contents

2007 2007 IEEE International Conference on Cluster Computing  
Scott, Chokchai Box Leangsuksun) 488 Identifying Energy-Efficient Concurrency Levels Using Machine Learning (Matthew Curtis-Maury, Karan Singh, Sally A. McKee, Filip Blagojevic, Dimitrios S.  ...  Panda) 452 A Reliability-Aware Approach for an Optimal Checkpoint/Restart Model in HPC Environments (Yudan Liu, Raja Nassar, Chokchai Box Leangsuksun, Nichamon Naksinehaboon, Mihaela Paun, Stephen  ... 
doi:10.1109/clustr.2007.4629204 fatcat:x46l5mpk2bhknm22siovutaqju

Acceleration of MPI mechanisms for sustainable HPC applications

2015 Supercomputing Frontiers and Innovations  
, and their usage from applications. 28 Supercomputing Frontiers and Innovations with other libraries and integrated into dynamic high-level programming paradigms, permit the development of adaptable applications  ...  Section 3 talks about storage and I/O techniques, Section 4 deals with energy constraints, and Section 5 presents some application and algorithm optimizations. The final section concludes the paper.  ...  However, system level checkpoint/restart is unable, in its current state, to cope with very adversarial future failure patterns.  ... 
doi:10.14529/jsfi150202 fatcat:hnu3cj5nwzhmjccfwa2drudck4

Exploring versioned distributed arrays for resilience in scientific applications

A Chien, P Balaji, N Dun, A Fang, H Fujita, K Iskra, Z Rubenstein, Z Zheng, J Hammond, I Laguna, D Richards, A Dubey (+5 others)
2016 The international journal of high performance computing applications  
GVR's multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient.  ...  The required changes are small (\ 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes.  ...  Multi-stream versioning allows applications to optimize cost of versioning and provided resilience.  ... 
doi:10.1177/1094342016664796 fatcat:aaipn5vawrg4dhzka4rigj325y

Speeding up Deep Learning with Transient Servers [article]

Shijian Li and Robert J. Walls and Lijie Xu and Tian Guo
2019 arXiv   pre-print
Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations.  ...  Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers.  ...  The reason we use an on-demand instance for the parameter server for distributed training is to avoid the checkpoint restarts that would result if parameter server was revoked.  ... 
arXiv:1903.00045v2 fatcat:kdg2ggymibasvoirjmcl74blmi

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis [article]

Tal Ben-Nun, Torsten Hoefler
2018 arXiv   pre-print
We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search.  ...  Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design.  ...  The simplest form of fault tolerance in machine learning is checkpoint/restart, in which w (t ) is periodically synchronized and persisted to a non-volatile data store (e.g., a hard drive).  ... 
arXiv:1802.09941v2 fatcat:ne2wiplln5eavjvjwf5to7nwsu

Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks

Qing Liu, Jeremy Logan, Yuan Tian, Hasan Abbasi, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, Roselyne Tchoua, Jay Lofstead, Ron Oldfield, Manish Parashar, Nagiza Samatova (+5 others)
2013 Concurrency and Computation  
Focusing on putting users first with a service oriented architecture, we combined cutting edge research into new I/O techniques with a design effort to create near optimal I/O methods.  ...  As a result, ADIOS provides the highest level of synchronous I/O performance for a number of mission critical applications at various Department of Energy Leadership Computing Facilities.  ...  Their shared experiences with parallel I/O helped  ... 
doi:10.1002/cpe.3125 fatcat:iieybtpgojdedlmlaes26argzu

Author Index

2008 2008 IEEE International Symposium on Parallel and Distributed Processing  
Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems Ho, Roy S.C.  ...  Algorithm for the Maximum Flow Problem Leangsuksun, Chokchai (Box) An Optimal Checkpoint/Restart Model for a Large Scale High Performance Computing System Lee, Chee Wai Towards Scalable Performance Analysis  ... 
doi:10.1109/ipdps.2008.4536576 fatcat:7unikf5ywjhjtdd6xtrmcom3gq

A Study of Checkpointing in Large Scale Training of Deep Neural Networks [article]

Elvis Rojas, Albert Njoroge Kahira, Esteban Meneses, Leonardo Bautista Gomez, Rosa M Badia
2020 arXiv   pre-print
In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads.  ...  We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow).  ...  It gives us an idea of the behavior of the checkpoint-restart and its relationship with the performance of distributed training.  ... 
arXiv:2012.00825v1 fatcat:r3kw6fvx6ffylpwdoejts55g5i

AI-Ckpt

Bogdan Nicolae, Franck Cappello
2013 Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13  
Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application  ...  Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed  ...  Checkpoint-Restart (CR) [14] is a popular approach to provide fault-tolerance for scientific applications.  ... 
doi:10.1145/2493123.2462918 fatcat:qxyhp3sverbcrindfgtqzd5tcq

AI-Ckpt

Bogdan Nicolae, Franck Cappello
2013 Proceedings of the 22nd international symposium on High-performance parallel and distributed computing  
Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application  ...  Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed  ...  Checkpoint-Restart (CR) [14] is a popular approach to provide fault-tolerance for scientific applications.  ... 
doi:10.1145/2462902.2462918 fatcat:47g3ebymejgyvfb62rs5zevbaq

Message Passing Interface (MPI) [chapter]

2005 Advanced Computer Architecture and Parallel Processing  
as other MPI implementations, are based on monolithic software architectures that -regardless of how well-abstracted and logically constructed -are highly complex software packages, presenting a steep learning  ...  BLCR (Berkeley Lab's Checkpoint/Restart) [33] is a kernel implementation of checkpoint/restart for multi-threaded applications on Linux [20] .  ...  BLCR provides a simple user-level interface to libraries/applications that need to interact with checkpoint/restart.  ... 
doi:10.1002/0471478385.ch9 fatcat:dze6oxxnirftpnqrdqzuczbzcu

The Italian research on HPC key technologies across EuroHPC

Marco Aldinucci, Giovanni Agosta, Antonio Andreini, Claudio A. Ardagna, Andrea Bartolini, Alessandro Cilardo, Biagio Cosenza, Marco Danelutto, Roberto Esposito, William Fornaciari, Roberto Giorgi, Davide Lengani (+5 others)
2021 Proceedings of the 18th ACM International Conference on Computing Frontiers  
checkpoint/restart functions and remote direct accelerator memory access, to be used for example in MPI one-sided communication primitives with heterogeneous workloads.  ...  properties at system and application level; (4) the development of machine learning and artificial intelligence tools that would enhance the effectiveness in designing key industrial components.  ... 
doi:10.1145/3457388.3458508 fatcat:nbnzfa2frvbpflj6tcbsk4bwcq

Towards an Exascale Enabled Sparse Solver Repository [chapter]

Jonas Thies, Martin Galgon, Faisal Shahzad, Andreas Alvermann, Moritz Kreutzer, Andreas Pieper, Melven Röhrig-Zöllner, Achim Basermann, Holger Fehske, Georg Hager, Bruno Lang, Gerhard Wellein
2016 Lecture Notes in Computational Science and Engineering  
Node-level checkpointing using SCR: A more scalable approach has been evaluated using the Scalable Checkpoint-Restart (SCR) library [44] , which provides node-level checkpoint/restart mechanisms.  ...  Fault Tolerance Strategy The strategy followed in the ESSR to achieve fault tolerance w.r.t. hardware failures can be classified as an application-level checkpoint/restart (C/R) method.  ... 
doi:10.1007/978-3-319-40528-5_13 fatcat:jancdp27w5hktf5y6utn533zwi
« Previous Showing results 1 — 15 out of 88 results