Replication is more efficient than you think

2019 Zenodo  
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing
more » ... checkpointing period \(P_{Daly} = \sqrt{2 \mu C}\) à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint, which may introduce additional overhead during checkpoints but prevents the application configuration from degrading throughout successive checkpointing periods. We show how to compute the optimal checkpointing period for this restart strategy and prove that its length is an order of magnitude higher than \(P_{Daly}\). We show through simulations that using the appropriate period and the restart strategy, instead of \(P_{Daly}\) and the usual norestart strategy, significantly decreases the overhead induced by replication.
doi:10.5281/zenodo.2633271 fatcat:xyxtb4qv7jebrg6fxjnyn2byui