Filters








84 Hits in 2.7 sec

MATCH: An MPI Fault Tolerance Benchmark Suite [article]

Luanzheng Guo, Giorgis Georgakoudis, Konstantinos Parasyris, Ignacio Laguna, Dong Li
2021 arXiv   pre-print
ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design.  ...  Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications.  ...  National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194).  ... 
arXiv:2102.06894v1 fatcat:22m5pxsn2zbynedzrr3r6o4ej4

Supporting the Development of Resilient Message Passing Applications Using Simulation

Thomas Naughton, Christian Engelmann, Geoffroy Vallee, Swen Bohm
2014 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT).  ...  The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.  ...  With system-level checkpoint/restart, all process data is saved, while in application-level checkpoint/restart the application itself decides which data to save and to restore.  ... 
doi:10.1109/pdp.2014.74 dblp:conf/pdp/NaughtonEVB14 fatcat:ffugzhjgpfbwzoes4s6u6oc3di

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications [article]

Roberto Rocco, Davide Gadioli, Gianluca Palermo
2021 arXiv   pre-print
In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application.  ...  By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort.  ...  An alternative to application-level C/R frameworks are those working at system-level.  ... 
arXiv:2104.14246v1 fatcat:hm6r5cnz2zhlleeehvyvewnpl4

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance [chapter]

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna
2020 Lecture Notes in Computer Science  
We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive  ...  In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint.  ...  The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S.  ... 
doi:10.1007/978-3-030-50743-5_27 fatcat:qqoqkh6iqnhm5pspu6nwlfluou

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance [article]

Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna
2021 arXiv   pre-print
We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive  ...  However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage.  ...  Acknowledgments The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. This work was performed under the auspices of the U.S.  ... 
arXiv:2102.06896v1 fatcat:hfeyoayqurfk3ew3377n5fakh4

Legio: fault resiliency for embarrassingly parallel MPI applications

Roberto Rocco, Davide Gadioli, Gianluca Palermo
2021 Journal of Supercomputing  
In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications.  ...  Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one.  ...  These frameworks couple ULFM with a method to restore the execution (typically C/R based) and create an all-in-one tool improving the reliability of an MPI application.  ... 
doi:10.1007/s11227-021-03951-w fatcat:mccthxlkvvda7fhepnbkojtpy4

A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations [article]

Nils Kohl, Johannes Hötzer, Florian Schornbaum, Martin Bauer, Christian Godenschwager, Harald Köstler, Britta Nestler, Ulrich Rüde
2018 arXiv   pre-print
To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI.  ...  In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain.  ...  (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC) and SuperMUC at Leibniz Supercomputing Centre (www.lrz.de  ... 
arXiv:1708.08286v2 fatcat:bkir7z4p5nfcjkkjbz3onfw6fe

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery [article]

Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann
2018 arXiv   pre-print
We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing.  ...  In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard  ...  For example, the Local Failure Local Recovery (LFLR) [4] approach uses an early implementation of ULFM MPI to facilitate recovery using spare processes.  ... 
arXiv:1801.04523v1 fatcat:b2r3hjm5qjggnj73dzm4zhn4ji

Post-failure recovery of MPI communication capability

Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra
2013 The international journal of high performance computing applications  
Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect  ...  As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications  ...  Lastly, we took the viewpoint of MPI users, and depicted how the ULFM specification can be used to support high-level recovery strategies.  ... 
doi:10.1177/1094342013488238 fatcat:w3tz4nficbavxccaeiq2zel3sy

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

Chris D. Cantwell, Allan S. Nielsen
2018 Journal of Scientific Computing  
Our approach combines the proposed user-level failure mitigation extensions to the Message-Passing Interface (MPI), with the concepts of message-logging and remote inmemory checkpointing, to demonstrate  ...  Many existing parallel simulation codes are not tolerant of these failures and existing resilience methodologies would necessitate major modifications or redesign of the application.  ...  We are grateful to the other partners in the ExaFLOW project for their support and helpful suggestions. This work used EPCCs Cirrus HPC Service (https://www.epcc.ed.ac.uk/cirrus).  ... 
doi:10.1007/s10915-018-0778-7 fatcat:gzhye5opqjf5tgw2k2emirzhwi

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

Christian Engelmann, Thomas Naughton
2013 2013 42nd International Conference on Parallel Processing  
handling using application-level checkpoint/restart.  ...  These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.  ...  The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide  ... 
doi:10.1109/icpp.2013.114 dblp:conf/icpp/EngelmannN13 fatcat:onbxzxytazaktejzc7e77gvwge

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann
2018 Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering - ICPE '18  
Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures.  ...  Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques  ...  An implementation of this pattern using the ULFM extensions to MPI would use MPI_COMM_SHRINK primitive to isolate a failed process from the MPI communicator used by the application.  ... 
doi:10.1145/3184407.3184421 dblp:conf/wosp/AshrafHE18 fatcat:rsw5tz6zrre3xf2a37ejyzp7pu

Running resilient MPI applications on a Dynamic Group of Recommended Processes

Edson Tavares de Camargo, Elias P. Duarte
2018 Journal of the Brazilian Computer Society  
High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults.  ...  Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application.  ...  Acknowledgments We would like to thank the funding agencies and universities involved for the support provided. We also thank the many contributions from the reviewers.  ... 
doi:10.1186/s13173-018-0069-z fatcat:d7g75lo5zja6tbfvtwfuuxoyae

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, Chris D Cantwell, Peter D Düben, Mike Gillard, Luc Giraud, Dominik Göddeke, Erwan Raffin, Keita Teranishi, Nils Wedi
2021 The international journal of high performance computing applications  
This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems.  ...  A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation  ...  Acknowledgements We thank the authors of Agullo et al. (2016a Agullo et al. ( , 2016b)) , namely E Agullo, L Giraud, A Guermouche, J Roman, P Salas, and M Zounon, for the permission to report the  ... 
doi:10.1177/1094342021990433 fatcat:tfhovb6xmfemtkgzzrkpiiiju4

Running Resilient MPI Applications on a Dynamic Group of Recommended Processes

Edson Tavares De Camargo, Elias P. Duarte
2016 2016 Seventh Latin-American Symposium on Dependable Computing (LADC)  
High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults.  ...  Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application.  ...  Acknowledgments We would like to thank the funding agencies and universities involved for the support provided. We also thank the many contributions from the reviewers.  ... 
doi:10.1109/ladc.2016.14 dblp:conf/ladc/CamargoD16 fatcat:yhjwom5aqbabxb4su2sefaqrs4
« Previous Showing results 1 — 15 out of 84 results