Real-time, concurrent checkpoint for parallel programs

K. Li, J. F. Naughton, J. S. Plank
1990 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming - PPOPP '90  
We have developed and implemented a checkpointing and restart algorithm for parallel programs running on commercial uniprocessors and shared-memory multipro cessors. The algorithm runs concurrently with the target program, interrupts the target program for small, fixed amounts of time and is transparent to the checkpointed program and its compiler. The algorithm achieves its efficiency through a novel use of address translation hardware that allows the most time-consuming operations of the
more » ... point to be overlapped with the running of the program being checkpointed. Introduct ion This paper presents a checkpointing and restart algorithm for parallel programs running on commercial uniprocessors and multiprocessors. The algorithm runs concurrently with the target program, interrupts the target program for small, fixed amounts of time (under 0.1 seconds in our implementation) and requires no changes to the target's code or its compiler. One use of a checkpointing algorithm is to allow longrunning programs to be resumed after a crash without having to restart at the beginning of the computation.
doi:10.1145/99163.99173 dblp:conf/ppopp/LiNP90 fatcat:ymskyg75qvet7ope7jgmqg44wi