Restoring consistent global states of distributed computations

Arthur P. Goldberg, Ajei Gopal, Andy Lowry, Rob Strom
1991 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging - PADD '91  
We present a mechanism for restoring any consistent global state of a distributed computation. This capability can form the baais of support for rollback and replay of computations, an activity we view aa essential in a comprehensive environment for debugging distributed programs. Our mechanism records occasional state checkpoints and logs all messages communicated between processes. Our mechanism offers flexibility in the following ways: any consistent global state of the computation can be
more » ... tored; execution can be replayed either exactly as it occurred initially or with user-controlled variations; there is no need to know a prioti what states might be of interest. In addition, if checkpoints and logs are written to stable storage, our mechanism can be used to restore states of computations that cause the system to crash.
doi:10.1145/122759.122772 dblp:conf/pdd/GoldbergGLS91 fatcat:7mopswmpzzbpjl45vpz7u34qaa