A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
MATCH: An MPI Fault Tolerance Benchmark Suite
[article]
2021
arXiv
pre-print
ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. ...
Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. ...
National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194). ...
arXiv:2102.06894v1
fatcat:22m5pxsn2zbynedzrr3r6o4ej4
Supporting the Development of Resilient Message Passing Applications Using Simulation
2014
2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). ...
The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT. ...
With system-level checkpoint/restart, all process data is saved, while in application-level checkpoint/restart the application itself decides which data to save and to restore. ...
doi:10.1109/pdp.2014.74
dblp:conf/pdp/NaughtonEVB14
fatcat:ffugzhjgpfbwzoes4s6u6oc3di
Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications
[article]
2021
arXiv
pre-print
In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. ...
By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. ...
An alternative to application-level C/R frameworks are those working at system-level. ...
arXiv:2104.14246v1
fatcat:hm6r5cnz2zhlleeehvyvewnpl4
Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
[chapter]
2020
Lecture Notes in Computer Science
We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive ...
In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. ...
The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S. ...
doi:10.1007/978-3-030-50743-5_27
fatcat:qqoqkh6iqnhm5pspu6nwlfluou
Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance
[article]
2021
arXiv
pre-print
We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive ...
However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. ...
Acknowledgments The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S. ...
arXiv:2102.06896v1
fatcat:hfeyoayqurfk3ew3377n5fakh4
Legio: fault resiliency for embarrassingly parallel MPI applications
2021
Journal of Supercomputing
In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. ...
Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. ...
These frameworks couple ULFM with a method to restore the execution (typically C/R based) and create an all-in-one tool improving the reliability of an MPI application. ...
doi:10.1007/s11227-021-03951-w
fatcat:mccthxlkvvda7fhepnbkojtpy4
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
[article]
2018
arXiv
pre-print
To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. ...
In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. ...
(www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC) and SuperMUC at Leibniz Supercomputing Centre (www.lrz.de ...
arXiv:1708.08286v2
fatcat:bkir7z4p5nfcjkkjbz3onfw6fe
Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery
[article]
2018
arXiv
pre-print
We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. ...
In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard ...
For example, the Local Failure Local Recovery (LFLR) [4] approach uses an early implementation of ULFM MPI to facilitate recovery using spare processes. ...
arXiv:1801.04523v1
fatcat:b2r3hjm5qjggnj73dzm4zhn4ji
Post-failure recovery of MPI communication capability
2013
The international journal of high performance computing applications
Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect ...
As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications ...
Lastly, we took the viewpoint of MPI users, and depicted how the ULFM specification can be used to support high-level recovery strategies. ...
doi:10.1177/1094342013488238
fatcat:w3tz4nficbavxccaeiq2zel3sy
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers
2018
Journal of Scientific Computing
Our approach combines the proposed user-level failure mitigation extensions to the Message-Passing Interface (MPI), with the concepts of message-logging and remote inmemory checkpointing, to demonstrate ...
Many existing parallel simulation codes are not tolerant of these failures and existing resilience methodologies would necessitate major modifications or redesign of the application. ...
We are grateful to the other partners in the ExaFLOW project for their support and helpful suggestions. This work used EPCCs Cirrus HPC Service (https://www.epcc.ed.ac.uk/cirrus). ...
doi:10.1007/s10915-018-0778-7
fatcat:gzhye5opqjf5tgw2k2emirzhwi
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
2013
2013 42nd International Conference on Parallel Processing
handling using application-level checkpoint/restart. ...
These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique. ...
The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide ...
doi:10.1109/icpp.2013.114
dblp:conf/icpp/EngelmannN13
fatcat:onbxzxytazaktejzc7e77gvwge
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
2018
Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering - ICPE '18
Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. ...
Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques ...
An implementation of this pattern using the ULFM extensions to MPI would use MPI_COMM_SHRINK primitive to isolate a failed process from the MPI communicator used by the application. ...
doi:10.1145/3184407.3184421
dblp:conf/wosp/AshrafHE18
fatcat:rsw5tz6zrre3xf2a37ejyzp7pu
Running resilient MPI applications on a Dynamic Group of Recommended Processes
2018
Journal of the Brazilian Computer Society
High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. ...
Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. ...
Acknowledgments We would like to thank the funding agencies and universities involved for the support provided. We also thank the many contributions from the reviewers. ...
doi:10.1186/s13173-018-0069-z
fatcat:d7g75lo5zja6tbfvtwfuuxoyae
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
2021
The international journal of high performance computing applications
This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. ...
A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation ...
Acknowledgements We thank the authors of Agullo et al. (2016a Agullo et al. ( , 2016b)) , namely E Agullo, L Giraud, A Guermouche, J Roman, P Salas, and M Zounon, for the permission to report the ...
doi:10.1177/1094342021990433
fatcat:tfhovb6xmfemtkgzzrkpiiiju4
Running Resilient MPI Applications on a Dynamic Group of Recommended Processes
2016
2016 Seventh Latin-American Symposium on Dependable Computing (LADC)
High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. ...
Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. ...
Acknowledgments We would like to thank the funding agencies and universities involved for the support provided. We also thank the many contributions from the reviewers. ...
doi:10.1109/ladc.2016.14
dblp:conf/ladc/CamargoD16
fatcat:yhjwom5aqbabxb4su2sefaqrs4
« Previous
Showing results 1 — 15 out of 84 results