A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2007; you can also visit the original URL.
The file type is application/pdf
.
Filters
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs
Proceedings of the ACM/IEEE SC2004 Conference
DIFFICULTIES IN APPLICATION-LEVEL CHECKPOINTING OF MPI PRO-GRAMS In this section, we describe the difficulties with implementing application-level, coordinated, non-blocking checkpointing for MPI programs ...
In this paper, we describe our implementation of application-level, non-blocking checkpointing. ...
We would like to thank the staff at both centers for cheerfully putting up with our urgent and repeated requests for time on these machines. ...
doi:10.1109/sc.2004.29
dblp:conf/sc/SchulzBFMPS04
fatcat:k3pjn65qvrha7f4n6o4qxz2jxm
A scalable double in-memory checkpoint and restart scheme towards exascale
2012
IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. ...
We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. ...
A runtime-level checkpointing scheme reduces the burden on the programmer, especially by automating the protocol for triggering checkpoints, and carrying out a recovery. ...
doi:10.1109/dsnw.2012.6264677
dblp:conf/dsn/ZhengNK12
fatcat:p56cp4bohzh7jli3rrfvtkb4sy
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
2009
IEEE transactions on computers
The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. ...
We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. ...
If an application needs to tolerate these types of failures, a two-level recovery scheme [26] , which uses both diskless checkpointing and stable-storage-based checkpointing, is a good choice. ...
doi:10.1109/tc.2009.42
fatcat:5et7fpfxvrah3jyngwe4zhoj2m
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
2016
The international journal of high performance computing applications
We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. ...
Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. ...
of a task-level scheme The overhead of a task-level scheme is constituted by the checkpoint overhead of all tasks, the rework and restart overheads of the failed tasks, and the message logging overheads ...
doi:10.1177/1094342016669416
fatcat:6nfmmfblyfbbhaxuaezhqzuw6u
Acceleration of MPI mechanisms for sustainable HPC applications
2015
Supercomputing Frontiers and Innovations
management, and their usage for applications. ...
Some of those solutions could be integrated and provided by MPI, but others should be devised as higher level concepts, less general, but adapted to applicative domains, possibly as programming patterns ...
In this sense, AHPIOS (Ad-Hoc Parallel I/O system for MPI applications) [14] proposes a scalable parallel I/O system completely implemented in MPI. ...
doi:10.14529/jsfi150202
fatcat:hnu3cj5nwzhmjccfwa2drudck4
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
[article]
2018
arXiv
pre-print
In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. ...
To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. ...
(www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC) and SuperMUC at Leibniz Supercomputing Centre (www.lrz.de ...
arXiv:1708.08286v2
fatcat:bkir7z4p5nfcjkkjbz3onfw6fe
Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI
2013
Concurrency and Computation
This checkpoint is reloaded in a new MPI application, which restores a sane environment for the forward, application-based recovery technique to repair the failure-damaged dataset. ...
paradigm for parallel applications (the message passing interface (MPI) Standard) offers extremely limited support of software-level fault tolerance approaches. ...
ACKNOWLEDGEMENTS This work has been supported in part by grants of the National Science Foundation and sponsored in part by a gift from the University Industry Research Corporation for the research proposal ...
doi:10.1002/cpe.3100
fatcat:hrpi6g5kvvhljfcqn7wl7s3yha
Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
[chapter]
2020
Lecture Notes in Computer Science
In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. ...
We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive ...
The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S. ...
doi:10.1007/978-3-030-50743-5_27
fatcat:qqoqkh6iqnhm5pspu6nwlfluou
Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance
[article]
2021
arXiv
pre-print
In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. ...
We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive ...
Acknowledgments The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S. ...
arXiv:2102.06896v1
fatcat:hfeyoayqurfk3ew3377n5fakh4
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI
[chapter]
2012
Lecture Notes in Computer Science
paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. ...
In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring ...
the MPI standard, for applications capable of taking advantage of forward recovery. ...
doi:10.1007/978-3-642-32820-6_48
fatcat:qub7c7dtfbgn5mrfakhxjipgpy
Aspect-oriented development of cluster computing software
2011
Cluster Computing
Aspect-Oriented Programming (AOP) is a powerful method for modularizing source code and for decoupling cross-cutting concerns. ...
In complex software systems, modularity and readability tend to be degraded owing to inseparable interactions between concerns that are distinct features in a program. ...
The ICT at Seoul National University provided research facilities for this study. ...
doi:10.1007/s10586-011-0166-7
fatcat:t5dnql2xjbgzta6d4lwkfmlbym
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
2006
ACM SIGOPS Operating Systems Review
These schemes are based on selfcheckpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven ...
The schemes also allow the program to be restarted on a different number of processors. ...
Acknowledgements This work was supported in part by DOE (Grant B341494 and B505214), the National Science Foundation (NGS 0103645 and ITR 0205611). ...
doi:10.1145/1131322.1131340
fatcat:ryv4mhqvqjejhjuzmffvuzj2dy
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
2002
ACM/IEEE SC 2002 Conference (SC'02)
We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. ...
To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. ...
Brigitte Rozoy for their help in the design of MPICH-V general protocol. ...
doi:10.1109/sc.2002.10048
dblp:conf/sc/BosilcaBCDFGHLLMNS02
fatcat:qjda5ip2znfc7c6disuqowd2mi
How to Mitigate Node Failures in Hybrid Parallel Applications
[chapter]
2016
Lecture Notes in Computer Science
-Other approaches include MPI-3 shared memory model -No fault tolerance is supported -must be provided on application level (as for MPI in general) ...
OpenMP) -MPI model provides full support for threads • In search for scalability these two models are coupled (hybrid parallelism) -Notable example: MPI+OpenMP -iter-node and intra-node connectivity respectively ...
"Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver." ...
doi:10.1007/978-3-319-32152-3_4
fatcat:qo6dp3dcfvfofionjakrcbzbky
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
2018
IEEE Transactions on Parallel and Distributed Systems
As means of overhead reduction, the library offers a built-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. ...
Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. ...
George Bosilca and Dr. Klaus Iglberger for valuable suggestions and input which helped us overcome design and implementation challenges. ...
doi:10.1109/tpds.2018.2866794
fatcat:exthchqwnnf5npli7jchz4jm7u
« Previous
Showing results 1 — 15 out of 531 results