Filters








2,083 Hits in 8.0 sec

Implementing Efficient Message Logging Protocols as MPI Application Extensions [article]

Kiril Dichev, Dimitrios S. Nikolopoulos
2019 arXiv   pre-print
Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications.  ...  Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels.  ...  In this work, we challenge this landscape and implement a message logging protocol as an efficient and configurable extension of HPC application kernels.  ... 
arXiv:1905.03184v1 fatcat:rqsr3nwuevdpjorgkuusevfaj4

Implementing efficient message logging protocols as MPI application extensions

Kiril Dichev, Dimitrios S. Nikolopoulos
2019 Proceedings of the 26th European MPI Users' Group Meeting on - EuroMPI '19  
In this work, we challenge this landscape and implement a message logging protocol as an efficient and configurable extension of HPC application kernels.  ...  We believe this landscape clearly motivates our work to implement message logging capabilities as application extensions on top of popular and widely-used MPI implementations.  ... 
doi:10.1145/3343211.3343219 dblp:conf/pvm/DichevN19 fatcat:rosirxdlkbcgzmnvcsdi6hwwcy

Performance Evaluation of Open MPI on Cray XE/XK Systems

Samuel K. Gutierrez, Nathan T. Hjelm, Manjunath Gorentla Venkata, Richard L. Graham
2012 2012 IEEE 20th Annual Symposium on High-Performance Interconnects  
LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency.  ...  In this paper, we present extensions to natively support these architectures within Open MPI; describe and propose solutions for performance and scalability bottlenecks; and provide an extensive evaluation  ...  Bandwidth on Cielo as reported by NetPIPE (log-log plot). Fig. 7 . 7 Calculated LAMMPS parallel efficiency for 100 iterations of the weak-scaling Lennard-Jones Liquid problem (higher is better).  ... 
doi:10.1109/hoti.2012.11 dblp:conf/hoti/GutierrezHVG12 fatcat:whfzr6msunadxe6dyxftvnl72e

A Case for Non-blocking Collective Operations [chapter]

Torsten Hoefler, Jeffrey M. Squyres, Wolfgang Rehm, Andrew Lumsdaine
2006 Lecture Notes in Computer Science  
Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators  ...  90% idle CPU time can be freed for the application.  ...  However, also Ethernet has been optimized for lower host overhead with simplified protocols [27] as well as direct user level access and protocol offloading [28] .  ... 
doi:10.1007/11942634_17 fatcat:mrzxflsnvvcfljgiaqzmzjkbjq

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. Cappello
2006 The international journal of high performance computing applications  
The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications.  ...  MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI.  ...  Another current trend is the use of MPI as the message passing environment for high performance parallel applications.  ... 
doi:10.1177/1094342006067469 fatcat:6fmggl6mz5djhjpzgvvj3ezc5i

Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery

Aurelien Bouteiller, Thomas Ropars, George Bosilca, Christine Morin, Jack Dongarra
2009 2009 IEEE International Conference on Cluster Computing and Workshops  
In this paper we compare, experimentally, a pessimistic and an optimistic message logging protocol, using this new model and implemented in the Open MPI library.  ...  Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher failure rate.  ...  IMPLEMENTATION DETAILS This section details the implementation of the two protocols in Open MPI.  ... 
doi:10.1109/clustr.2009.5289157 dblp:conf/cluster/BouteillerRBMD09 fatcat:ilb5qdq3ljgzxb3b66gfzsjg5e

C 3: A System for Automating Application-Level Checkpointing of MPI Programs [chapter]

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill
2004 Lecture Notes in Computer Science  
We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library.  ...  This thin layer is used by the C 3 (Cornell Checkpoint (pre-)Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version.  ...  We have shown how the state of the underlying MPI library can be reconstructed by the implementation of our protocol.  ... 
doi:10.1007/978-3-540-24644-2_23 fatcat:7cyt6lo2bjdhrej5aq4modjxgq

SPBC

Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, André Schiper, Franck Cappello
2013 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13  
Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging.  ...  Most existing checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale.  ...  well as other funding bodies (see https://www.grid5000.fr).  ... 
doi:10.1145/2503210.2503271 dblp:conf/sc/RoparsMGSC13 fatcat:5jf4jcuguffyjca7gchmefcsaq

A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI [chapter]

Joshua Hursey, Thomas Naughton, Geoffroy Vallee, Richard L. Graham
2011 Lecture Notes in Computer Science  
This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages.  ...  The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics.  ...  Instead of providing a fault tolerant agreement protocol to the application (i.e., MPI Comm validate all), the FT-MPI project provides it as a transparent component of the runtime environment which is  ... 
doi:10.1007/978-3-642-24449-0_29 fatcat:j5c757rnrfhatdizzedbrl5tpu

Automated application-level checkpointing of MPI programs

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill
2003 SIGPLAN notices  
In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state.  ...  High-level description of protocol Phase #1 To initiate a distributed snapshot, the initiator sends a control message called pleaseCheckpoint to all application processes.  ...  The goal of our project is to provide a highly efficient checkpointing mechanism for MPI applications.  ... 
doi:10.1145/966049.781513 fatcat:snsormha3ffc7mbjmgszyz4fgy

Automated application-level checkpointing of MPI programs

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill
2003 Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '03  
In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state.  ...  High-level description of protocol Phase #1 To initiate a distributed snapshot, the initiator sends a control message called pleaseCheckpoint to all application processes.  ...  The goal of our project is to provide a highly efficient checkpointing mechanism for MPI applications.  ... 
doi:10.1145/781512.781513 fatcat:7p2td6aqljayhnjxctzyirxzu4

Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand

Qi Gao, Wei Huang, Matthew J. Koop, Dhabaleswar K. Panda
2007 Proceedings of the International Conference on Parallel Processing  
Severe contention for bandwidth to storage system can occur as a large number of processes take a checkpoint at the same time, resulting in an extremely long checkpointing delay for large parallel applications  ...  We implement our design and carry out a detailed evaluation with micro-benchmarks, HPL, and the parallel version of a data mining toolkit, MotifMiner.  ...  Although our current implementation is based on InfiniBand and MVAPICH2, the design can be readily applicable to other coordination checkpointing protocols for other MPI implementations and networks.  ... 
doi:10.1109/icpp.2007.44 dblp:conf/icpp/GaoHKP07 fatcat:6rwfsxf5q5hxbcnrkn556leake

Fault-tolerant solutions for a MPI compute intensive application

J.C. Mourino, M.J. Martin, P. Gonzalez, R. Doallo
2007 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07)  
A segment-level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variablelevel solution has been implemented manually in the code.  ...  Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications.  ...  There are three classes of message log protocols according to their policy on how to store message logs: optimistic, pessimistic and casual message [2] .  ... 
doi:10.1109/pdp.2007.44 dblp:conf/pdp/MourinoMGD07 fatcat:ah5fb4c45rednl33jpzksyzwsu

Path-Synchronous Performance Monitoring in HPC Interconnection Networks with Source-Code Attribution [chapter]

Adarsh Yoga, Milind Chabbi
2017 Lecture Notes in Computer Science  
We have incorporated an effective protocol extension in the Gen-Z communication protocol for tagging network packets in an interconnection network; additionally, we have backed the protocol extension with  ...  The message gets enqueued as a command to the NIC. The NIC notices the command at some later point, which introduces an arbitrary delay.  ...  We propose the following extensions: Protocol extension: every packet of the protocol carries a special Performance Monitoring (PM) tag.  ... 
doi:10.1007/978-3-319-72971-8_11 fatcat:mcqpugtjovepdd5kjx2b5ioo2i

Evaluating the viability of process replication reliability for exascale systems

Kurt Ferreira, Jon Stearley, James H. Laros, Ron Oldfield, Kevin Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, Dorian Arnold
2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11  
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability.  ...  different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications  ...  To ensure consistent replica state, r MPI implements protocols that ensure identical message ordering between replicas.  ... 
doi:10.1145/2063384.2063443 dblp:conf/sc/FerreiraSLOPBRBA11 fatcat:kyoo75c27nbqfec77op6cfkf3e
« Previous Showing results 1 — 15 out of 2,083 results