Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

Pierre Riteau, Adrien Lèbre, Christine Morin
2009 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid  
Computer clusters are today the reference architecture for highperformance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper,
more » ... s. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.
doi:10.1109/ccgrid.2009.29 dblp:conf/ccgrid/RiteauLM09 fatcat:6lvjro6fizaghgsatk3n62evpe