Filters








574 Hits in 5.2 sec

Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters

Oren Laadan, Dan Phung, Jason Nieh
2005 Proceedings IEEE International Conference on Cluster Computing  
We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters.  ...  This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can be restarted from the checkpoint on a different set of cluster nodes  ...  We have created ZapC, a novel system that extends our previous work on Zap [24] to provide transparent coordinated checkpoint-restart of distributed network applications on commodity clusters.  ... 
doi:10.1109/clustr.2005.347039 dblp:conf/cluster/LaadanPN05 fatcat:hn4rirxqinaqxb725y3t6rdnoa

Checkpointing as a Service in Heterogeneous Cloud Environments

Jiajun Cao, Matthieu Simonin, Gene Cooperman, Christine Morin
2015 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  
The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.  ...  A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability.  ...  This was chosen for its transparent support of distributed applications, including both TCP/IP and the InfiniBand network [2] .  ... 
doi:10.1109/ccgrid.2015.160 dblp:conf/ccgrid/CaoSCM15 fatcat:mcqijnshwvcoplzcxvh4uh55aa

Checkpointing as a Service in Heterogeneous Cloud Environments [article]

Jiajun Cao, Matthieu Simonin, Gene Cooperman, Christine Morin
2015 arXiv   pre-print
The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.  ...  A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability.  ...  In a cloud environment, a distributed application is executed using a set of interconnected VMs called a virtual cluster.  ... 
arXiv:1411.1958v2 fatcat:ctg66jjodzex7ho2yeypdwqxka

Use of checkpoint-restart for complex HEP software on traditional architectures and Intel MIC

Kapil Arya, Gene Cooperman, Andrea Dotti, Peter Elmer
2014 Journal of Physics, Conference Series  
We report on tests of checkpoint-restart technology using CMS software, Geant4-MT (multi-threaded Geant4), and the DMTCP (Distributed Multithreaded Checkpointing) package.  ...  Process checkpoint-restart is a technology with great potential for use in HEP workflows.  ...  In this paper we examine the use of a transparent, user-level checkpointing package for distributed applications called Distributed MultiThreaded CheckPointing (DMTCP) [7] .  ... 
doi:10.1088/1742-6596/523/1/012015 fatcat:nnsf3uodzraqhenweodmfmq6e4

The DEEP-ER Project: I/O and Resiliency Extensions for the Cluster-Booster Architecture

Anke Kreuzer, Jorge Amaya, Norbert Eicker, Raphael Leger, Estela Suarez
2018 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)  
applications recovering from the larger hardware failure rates expected on these machines.  ...  Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.  ...  Schmidt (University of Heidelberg) for the NAM results, A. Galonska (JSC) for buddy-checkpointing benchmarks, and S. Rodríguez (BSC) OmpSs resiliency tests with FWI.  ... 
doi:10.1109/hpcc/smartcity/dss.2018.00046 dblp:conf/hpcc/KreuzerAELS18 fatcat:douhh3naufcr3cmri5t3gg3m5u

Fast and transparent recovery for continuous availability of cluster-based servers

Rosalia Christodoulopoulou, Kaloian Manassiev, Angelos Bilas, Cristiana Amza
2006 Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06  
In this paper we focus on providing support for shared-memory applications running on clusters of commodity nodes and interconnects.  ...  We present the design and implementation of FineFRC (Fine-grained Failure Reconfiguration on Clusters), a runtime system for achieving continuous operation of shared memory applications on commodity clusters  ...  We thankfully acknowledge the support of Natural Sciences and Engineering Research Council of Canada, IBM, Canada Foundation for Innovation, Ontario Centers of Excellence, the European Commission FP6 HiPEAC  ... 
doi:10.1145/1122971.1123005 dblp:conf/ppopp/ChristodoulopoulouMBA06 fatcat:xkgy5g5r6nbprngkivmrxxnwey

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, Andrew Lumsdaine, Jason Duell, Paul Hargrove, Eric Roman
2005 The international journal of high performance computing applications  
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability.  ...  Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance.  ...  Department of Energy under Contract No. DE-AC03-76SF00098. Brian Barrett was supported by a Department of Energy High Performance Computer Science fellowship.  ... 
doi:10.1177/1094342005056139 fatcat:eactszrrhncorlr4lu2f3hagla

Live Migration of Processes Maintaining Multiple Network Connections

Balazs Gerofi, Hajime Fujita, Yutaka Ishikawa
2010 IPSJ Online Transactions  
Single IP Address cluster offers a transparent view of a cluster of machines as if they were a single computer on the network.  ...  in less than 200 ms process freeze time, rendering the transition fully transparent and responsive from the clients' point of view.  ...  Acknowledgments This work has been supported by the CREST project of the Japan Science and Technology Agency (JST).  ... 
doi:10.2197/ipsjtrans.3.13 fatcat:s4tjr46bf5dqborcougkdh3iby

BlobCR

Bogdan Nicolae, Franck Cappello
2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11  
Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context.  ...  Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case  ...  The experiments presented in this paper were carried out using the Grid'5000/ALADDIN-G5K experimental testbed, an initiative of the French Ministry of Research through the ACI GRID incentive action, INRIA  ... 
doi:10.1145/2063384.2063429 dblp:conf/sc/NicolaeC11 fatcat:2cq4rtapwbapvgufqddfplkyte

Modeling and Designing Fault-Tolerance Mechanisms for MPI-Based MapReduce Data Computing Framework

Jian Lin, Fan Liang, Xiaoyi Lu, Li Zha, Zhiwei Xu
2015 2015 IEEE First International Conference on Big Data Computing Service and Applications  
memory dump required by checkpoint will bring in huge time and space overhead. (2) Checkpoint/restart demands to handle the reconstruction of inter-process communication and synchronization, which significantly  ...  on the integrity of its corresponding process group.  ...  High-performance computing systems usually run on reliable high-end hardware, while large-scale data computing systems often consider failures of commodity cluster and uneven quality of data as normal  ... 
doi:10.1109/bigdataservice.2015.33 dblp:conf/bigdataservice/LinLLZX15 fatcat:5aredotxsreorhxwergub7jham

Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart

Arif Ahmed, Apoorve Mohan, Gene Cooperman, Guillaume Pierre
2020 2020 8th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering (MobileCloud)  
To speed up the application startup times we propose to snapshot the state of fully-deployed containers and restart future container instances from a pre-started application state.  ...  Depending on the application this startup process can be arbitrarily long.  ...  The most notable ones are BLCR (Berkeley Lab's Checkpoint/Restart) [16] , CRIU (Checkpoint and Restart In Userspace) [26] and DMTCP (Distributed MultiThreaded Check-Pointing) [4] .  ... 
doi:10.1109/mobilecloud48802.2020.00016 dblp:conf/mobilecloud/0001MCP20 fatcat:oxmqs3akajhrzcptqzmz3wgthu

CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications

Hiroyuki Takizawa, Kentaro Koyama, Katsuto Sato, Kazuhiko Komatsu, Hiroaki Kobayashi
2011 2011 IEEE International Parallel & Distributed Processing Symposium  
This paper demonstrates the feasibility of transparent checkpointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads.  ...  In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for highperformance and dependable GPU computing.  ...  of Tokyo Institute of Technology for meaningful discussions on programming issues for checkpointing GPU computing applications.  ... 
doi:10.1109/ipdps.2011.85 dblp:conf/ipps/TakizawaKSKK11 fatcat:jhscghbvrjbjrirribklz6cc2q

Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer's Perspective [chapter]

Sebastian Gerlach, Basile Schaeli, Roger D. Hersch
2006 Lecture Notes in Computer Science  
Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations.  ...  The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint.  ...  Introduction and Related Work Clusters of commodity workstations are rapidly growing in size and complexity as computation power requirements increase.  ... 
doi:10.1007/11808107_9 fatcat:5ysxlk3r7vaidgude7dmmjkada

System-Level Scalable Checkpoint-Restart for Petascale Computing

Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jerome Vienne, Gene Cooperman
2016 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)  
Advances in transparent checkpointing on large-scale supercomputers depend on the fundamental problem of transparent checkpointing over InfiniBand: how to save or replay "in-flight data" that is present  ...  This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package.  ...  ACKNOWLEDGMENT We would like to acknowledge the comments and encouragement of Raghu Raja Chandrasekar in integrating DMTCP with MVAPICH2.  ... 
doi:10.1109/icpads.2016.0125 dblp:conf/icpads/CaoAGMPSVC16 fatcat:rdngzsu2cvgnjff5sgld5d5jau

System-level Scalable Checkpoint-Restart for Petascale Computing [article]

Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jérôme Vienne, Gene Cooperman
2016 arXiv   pre-print
This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package.  ...  Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing.  ...  Acknowledgment We would like to acknowledge the comments and encouragement of Raghu Raja Chandrasekar in integrating DMTCP with MVAPICH2.  ... 
arXiv:1607.07995v2 fatcat:cioihh6ecrg4dgtghtgv3enq6m
« Previous Showing results 1 — 15 out of 574 results