Filters








531 Hits in 2.7 sec

Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, P. Stodghill
Proceedings of the ACM/IEEE SC2004 Conference  
DIFFICULTIES IN APPLICATION-LEVEL CHECKPOINTING OF MPI PRO-GRAMS In this section, we describe the difficulties with implementing application-level, coordinated, non-blocking checkpointing for MPI programs  ...  In this paper, we describe our implementation of application-level, non-blocking checkpointing.  ...  We would like to thank the staff at both centers for cheerfully putting up with our urgent and repeated requests for time on these machines.  ... 
doi:10.1109/sc.2004.29 dblp:conf/sc/SchulzBFMPS04 fatcat:k3pjn65qvrha7f4n6o4qxz2jxm

A scalable double in-memory checkpoint and restart scheme towards exascale

Gengbin Zheng, Xiang Ni, Laxmikant V. Kale
2012 IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)  
With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint.  ...  We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers.  ...  A runtime-level checkpointing scheme reduces the burden on the programmer, especially by automating the protocol for triggering checkpoints, and carrying out a recovery.  ... 
doi:10.1109/dsnw.2012.6264677 dblp:conf/dsn/ZhengNK12 fatcat:p56cp4bohzh7jli3rrfvtkb4sy

Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

Zizhong Chen, Jack Dongarra
2009 IEEE transactions on computers  
The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers.  ...  We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example.  ...  If an application needs to tolerate these types of failures, a two-level recovery scheme [26] , which uses both diskless checkpointing and stable-storage-based checkpointing, is a good choice.  ... 
doi:10.1109/tc.2009.42 fatcat:5et7fpfxvrah3jyngwe4zhoj2m

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

Omer Subasi, Tatiana Martsinkevich, Ferad Zyulkyarov, Osman Unsal, Jesus Labarta, Franck Cappello
2016 The international journal of high performance computing applications  
We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme.  ...  Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage.  ...  of a task-level scheme The overhead of a task-level scheme is constituted by the checkpoint overhead of all tasks, the rework and restart overheads of the failed tasks, and the message logging overheads  ... 
doi:10.1177/1094342016669416 fatcat:6nfmmfblyfbbhaxuaezhqzuw6u

Acceleration of MPI mechanisms for sustainable HPC applications

2015 Supercomputing Frontiers and Innovations  
management, and their usage for applications.  ...  Some of those solutions could be integrated and provided by MPI, but others should be devised as higher level concepts, less general, but adapted to applicative domains, possibly as programming patterns  ...  In this sense, AHPIOS (Ad-Hoc Parallel I/O system for MPI applications) [14] proposes a scalable parallel I/O system completely implemented in MPI.  ... 
doi:10.14529/jsfi150202 fatcat:hnu3cj5nwzhmjccfwa2drudck4

A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations [article]

Nils Kohl, Johannes Hötzer, Florian Schornbaum, Martin Bauer, Christian Godenschwager, Harald Köstler, Britta Nestler, Ulrich Rüde
2018 arXiv   pre-print
In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain.  ...  To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI.  ...  (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC) and SuperMUC at Leibniz Supercomputing Centre (www.lrz.de  ... 
arXiv:1708.08286v2 fatcat:bkir7z4p5nfcjkkjbz3onfw6fe

Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI

Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra
2013 Concurrency and Computation  
This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset.  ...  paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches.  ...  ACKNOWLEDGEMENTS This work has been supported in part by grants of the National Science Foundation and sponsored in part by a gift from the University Industry Research Corporation for the research proposal  ... 
doi:10.1002/cpe.3100 fatcat:hrpi6g5kvvhljfcqn7wl7s3yha

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance [chapter]

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna
2020 Lecture Notes in Computer Science  
In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment.  ...  We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive  ...  The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S.  ... 
doi:10.1007/978-3-030-50743-5_27 fatcat:qqoqkh6iqnhm5pspu6nwlfluou

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance [article]

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna
2021 arXiv   pre-print
In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment.  ...  We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive  ...  Acknowledgments The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S.  ... 
arXiv:2102.06896v1 fatcat:hfeyoayqurfk3ew3377n5fakh4

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI [chapter]

Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra
2012 Lecture Notes in Computer Science  
paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches.  ...  In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring  ...  the MPI standard, for applications capable of taking advantage of forward recovery.  ... 
doi:10.1007/978-3-642-32820-6_48 fatcat:qub7c7dtfbgn5mrfakhxjipgpy

Aspect-oriented development of cluster computing software

Hyuck Han, Hyungsoo Jung, Heon Y. Yeom
2011 Cluster Computing  
Aspect-Oriented Programming (AOP) is a powerful method for modularizing source code and for decoupling cross-cutting concerns.  ...  In complex software systems, modularity and readability tend to be degraded owing to inseparable interactions between concerns that are distinct features in a program.  ...  The ICT at Seoul National University provided research facilities for this study.  ... 
doi:10.1007/s10586-011-0166-7 fatcat:t5dnql2xjbgzta6d4lwkfmlbym

Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

Gengbin Zheng, Chao Huang, Laxmikant V. Kalé
2006 ACM SIGOPS Operating Systems Review  
These schemes are based on selfcheckpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven  ...  The schemes also allow the program to be restarted on a different number of processors.  ...  Acknowledgements This work was supported in part by DOE (Grant B341494 and B505214), the National Science Foundation (NGS 0103645 and ITR 0205611).  ... 
doi:10.1145/1131322.1131340 fatcat:ryv4mhqvqjejhjuzmffvuzj2dy

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov
2002 ACM/IEEE SC 2002 Conference (SC'02)  
We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications.  ...  To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility.  ...  Brigitte Rozoy for their help in the design of MPICH-V general protocol.  ... 
doi:10.1109/sc.2002.10048 dblp:conf/sc/BosilcaBCDFGHLLMNS02 fatcat:qjda5ip2znfc7c6disuqowd2mi

How to Mitigate Node Failures in Hybrid Parallel Applications [chapter]

Maciej Szpindler
2016 Lecture Notes in Computer Science  
-Other approaches include MPI-3 shared memory model -No fault tolerance is supported -must be provided on application level (as for MPI in general)  ...  OpenMP) -MPI model provides full support for threads • In search for scalability these two models are coupled (hybrid parallelism) -Notable example: MPI+OpenMP -iter-node and intra-node connectivity respectively  ...  "Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver."  ... 
doi:10.1007/978-3-319-32152-3_4 fatcat:qo6dp3dcfvfofionjakrcbzbky

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

2018 IEEE Transactions on Parallel and Distributed Systems  
As means of overhead reduction, the library offers a built-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing.  ...  Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort.  ...  George Bosilca and Dr. Klaus Iglberger for valuable suggestions and input which helped us overcome design and implementation challenges.  ... 
doi:10.1109/tpds.2018.2866794 fatcat:exthchqwnnf5npli7jchz4jm7u
« Previous Showing results 1 — 15 out of 531 results