Testing Distributed Programs Containing Racing Messages

R. B. Kilgore
1997 Computer journal  
Finding errors in non-deterministic programs is complicated by the fact that a bug may be revealed only by a particular sequence of program activities. An erroneous program may run correctly hundreds or thousands of times, each time avoiding the failure-causing sequence. This problem is exacerbated in distributed systems since race conditions on messages may not be under the direct control of the programmer. We describe how message delivery ordering can be controlled during execution. Our
more » ... ive is to provide a practical yet powerful testing environment for distributed systems, using re-execution. Previous work in this area was limited to replaying deterministically the same execution repeatedly. We focus on re-executing the program, under a strictly different message ordering. In this way, latent bugs are more likely to reveal themselves during testing. We show that messages are grouped into waves, such that any two messages from different waves must always be received in the same order. We provide an algorithm that produces a re-execution that maximizes the number of reordered pairs of message delivery events. We prove a tight lower bound of k − 1 reordered pairs of messages where k is the number of messages in a wave. We also provide an ef"cient on-line algorithm for detecting racing messages. Previous methods for detecting race conditions were either off-line, or limited to detecting the races for a single process.
doi:10.1093/comjnl/40.8.489 fatcat:cawf5c53ezf6tnuhcachmohpqq