Filters








1,849 Hits in 1.4 sec

Towards Aggregated Asynchronous Checkpointing [article]

Mikaila J. Gossman, Bogdan Nicolae, Jon C. Calhoun, Franck Cappello, Melissa C. Smith
2021 arXiv   pre-print
To this end we implement and study two aggregation strategies, their limitations, and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing.  ...  This paper discusses the viability and challenges of designing aggregation techniques for asynchronous multi-level checkpointing.  ...  Towards Aggregated Asynchronous Checkpointing Mikaila J.  ... 
arXiv:2112.02289v1 fatcat:z6cdz46x6rf3vhg3ru5xruwtsi

Oolong

Christopher Mitchell, Russell Power, Jinyang Li
2012 Proceedings of the Asia-Pacific Workshop on Systems - APSYS '12  
The event-driven nature of triggers is particularly appropriate for asynchronous computation where workers can independently process part of the state towards convergence without any need for global synchronization  ...  Using Oolong, we have implemented solutions for several large-scale asynchronous computation problems, achieving good performance and robust fault tolerance.  ...  Piccolo is optimized for computation whose intermediate state fit in the aggregate memory of the cluster. Oolong extends Piccolo to provide support for asynchronous computation.  ... 
doi:10.1145/2349896.2349907 dblp:conf/apsys/MitchellPL12 fatcat:shwmcyyyirgu3pk2ikxy3dcdj4

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, Franck Cappello
2019 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
This paper proposes a versatile asynchronous checkpointing solution that addresses this problem.  ...  Index Terms-parallel I/O; checkpoint-restart; immutable data; adaptive multilevel asynchronous I/O  ...  Aggregation of asynchronous I/O using an active backend: With increasing core count per node, asynchronous checkpointing involves a significant coordination overhead needed to manage the producers and  ... 
doi:10.1109/ipdps.2019.00099 dblp:conf/ipps/NicolaeMGMC19 fatcat:6anbo4rezvedleejttnowfmnpe

Toward simulation-time data analysis and I/O acceleration on leadership-class systems

Venkatram Vishwanath, Mark Hereld, Michael E. Papka
2011 2011 IEEE Symposium on Large Data Analysis and Visualization  
Thus, effectively exploiting the network topology of BG/P, leveraging the data semantics of the applications, and asynchronous data staging are critical as we scale towards larger core counts.  ...  In each aggregator group, the node where the aggregation is performed is chosen such that the aggregator nodes are distributed across the collective network in a pset.  ... 
doi:10.1109/ldav.2011.6092178 dblp:conf/ldav/VishwanathHP11 fatcat:cp3q54dg35cdpk3oka7kxhqmve

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Bogdan Nicolae, Jiali Li, Justin M. Wozniak, George Bosilca, Matthieu Dorier, Franck Cappello
2020 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)  
This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/  ...  performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing.  ...  Exploiting local storage as a write cache layer to flush the application data to external storage asynchronously has been proposed before in the context of node-level aggregation of I/O from multiple cores  ... 
doi:10.1109/ccgrid49817.2020.00-76 dblp:conf/ccgrid/NicolaeLWBDC20 fatcat:s4565nfzczhfzmk4gir3tgkt64

AI-Ckpt

Bogdan Nicolae, Franck Cappello
2013 Proceedings of the 22nd international symposium on High-performance parallel and distributed computing  
Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in  ...  Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed  ...  page-by-page from the end towards the beginning).  ... 
doi:10.1145/2462902.2462918 fatcat:47g3ebymejgyvfb62rs5zevbaq

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, Galen Shipman
2010 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis  
In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks -checkpointing, de-duplication, and scientific data  ...  For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores  ...  Asynchronously, as discussed in Section IV the benefactor may drain data from the SSD to the secondary storage system. Once the checkpoint is complete, the benefactors inform the manager.  ... 
doi:10.1109/sc.2010.28 dblp:conf/sc/LiVBMMKES10 fatcat:dporqhnegvavnlu4hmdm3h6zzi

Detailed analysis of I/O traces for large scale applications

N. Nakka, A. Choudhary, W. K. Liao, L. Ward, R. Klundt, M. I. Weston
2009 2009 International Conference on High Performance Computing (HiPC)  
In particular, these I/O traces provide multiple indications towards the algorithmic nature of the application by observing the changes of data amount and I/O request distribution at the checkpoints.  ...  The key observations that we made in the trace were (1) Variation in aggregate data sizes across checkpoints for AMR and non-AMR applications, (2) Variation in the number of I/O calls by a client depending  ...  From a visual observation it is clear that the aggregate data transferred for each checkpoint is the same for every consecutive checkpoint for both versions of alegra.  ... 
doi:10.1109/hipc.2009.5433186 dblp:conf/hipc/NakkaCLWKW09 fatcat:ihzii6aexvar3ioxjpqffldf6e

AI-Ckpt

Bogdan Nicolae, Franck Cappello
2013 Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13  
Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in  ...  Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed  ...  page-by-page from the end towards the beginning).  ... 
doi:10.1145/2493123.2462918 fatcat:qxyhp3sverbcrindfgtqzd5tcq

Phantasy: Low-Latency Virtualization-based Fault Tolerance via Asynchronous Prefetching

Shiru Ren, Yunqi Zhang, Lichen Pan, Zhen Xiao
2018 IEEE transactions on computers  
low-latency fault tolerance in the virtualized environment, we first identify two bottlenecks in prior approaches, namely the overhead for tracking dirty pages in software and the long sequential dependency in checkpointing  ...  To address these bottlenecks, we design a novel mechanism to asynchronously prefetch the dirty pages without disrupting the primary VM execution to shorten the sequential dependency.  ...  In aggregate, Phantasy is able to reduce the dirty pages that need to be transmitted in the checkpoints by 55.16%.  ... 
doi:10.1109/tc.2018.2865943 fatcat:ymp766plznca5eti32da43tfgi

Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms

Michal Pitonak, Marian Gall, Adrian Rodriguez-Bazaga, Valeria Bartsch
2019 Zenodo  
This library automates the process of creating checkpoints and hides the overhead due to checkpoint distribution to mirrors via asynchronous GASPI communication.  ...  The 'CP commit' corresponds to total time spent in committing of checkpoints (created asynchronously) after each 50 iterations.  ... 
doi:10.5281/zenodo.2807937 fatcat:5szkqofx3bcqpaypsnrqnjtbue

ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization

Eric R. Schendel, Scott Klasky, Robert Ross, Nagiza F. Samatova, Saurabh V. Pendse, John Jenkins, David A. Boyuka, Zhenhuan Gong, Sriram Lakshminarasimhan, Qing Liu, Hemanth Kolla, Jackie Chen
2012 Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing - HPDC '12  
At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput.  ...  This case is important as future staging architectures shift toward dedicating compute nodes strictly to simulation work [21] , relying on asynchronous RDMA to offload data to I/O nodes and prevent simulation  ...  access to checkpoint files.  ... 
doi:10.1145/2287076.2287086 dblp:conf/hpdc/SchendelPJBBGLLKCKRS12 fatcat:ejt432xn4jasfkrfwis4gbktfy

Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing [article]

Da Yan, James Cheng, Fan Yang
2016 arXiv   pre-print
back to the latest checkpoint.  ...  Moreover, the high checkpointing cost prevents frequent checkpointing, and thus recovery has to replay all the computations from a state checkpointed some time ago.  ...  Chandy-Lamport snapshot [13] can be used for checkpointing asynchronous vertex-centric computation like that of GraphLab.  ... 
arXiv:1601.06496v1 fatcat:yq2gtxcvebgdnk2nube2dkvwtm

Asynchronous snapshots of actor systems for latency-sensitive applications

Dominik Aumayr, Stefan Marr, Elisa Gonzalez Boix, Hanspeter Mössenböck
2019 Proceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes - MPLR 2019  
In order to minimize the impact of snapshotting on request latency, our approach persists the application's state asynchronously by capturing partial heaps, completing snapshots step by step.  ...  To the best of our knowledge, this is the first system that enables asynchronous snapshotting of actor applications, i.e. without stop-the-world synchronization, and thereby minimizes the impact on latency  ...  Asynchronous Local Checkpointing for Actors To the best of our knowledge, the only checkpointing approach for actors is for SALSA [19] .  ... 
doi:10.1145/3357390.3361019 dblp:conf/pppj/AumayrMBM19 fatcat:dk2u5kkerva6tnkpzav7wv44ee

Fault Tolerance for Stream Processing Engines [article]

Muhammad Anis Uddin Nasir
2020 arXiv   pre-print
The checkpointing occurs in a asynchronous manner to avoid the overhead.  ...  ChronoStream [30] delivers fault tolerance for deterministic operators using asynchronous delta checkpointing.  ... 
arXiv:1605.00928v3 fatcat:kvdgebicrfbktogtew77mv7ppy
« Previous Showing results 1 — 15 out of 1,849 results