Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms

B. Bhargava, S.-R. Lian, P.-J. Leu
[1990] Proceedings. Sixth International Conference on Data Engineering  
We have implemented two classes of distributed checkpointing and rollback recovery algorithms and evaluated their performance in a real processing environment. One algorithm is based on the synchronous approach and the other on the asynchronous approach. The evaluation measures the overhead due to time spent in executing the algorithms and the cost in terms of computational time and message traffic. We identify the components that make up the execution time of these algorithms and study how
more » ... of them contributes to the total execution time. These data are validated by quantitative analysis. One objective of this study is to compare these approaches. This evaluation study is useful for a system designer in choosing the appropriate recovery algorithm based on the current application and environment. We believe that our study is the first attempt that implements, evaluates, and compares concurrent checkpointinglrecovery algorithms in distributed systems. The knowledge gained by our research can be applied to achieve efficient fault-tolerance in distributed database systems, distributed operating systems, and multiprocess environments.
doi:10.1109/icde.1990.113468 dblp:conf/icde/BhargavaLL90 fatcat:xglsb6b22fgdjnjfmf37kcfgxy