Diagnosing missing events in distributed systems with negative provenance

Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, Boon Thau Loo
2014 Proceedings of the 2014 ACM conference on SIGCOMM - SIGCOMM '14  
When debugging a distributed system, it is sometimes necessary to explain the absence of an event -for instance, why a certain route is not available, or why a certain packet did not arrive. Existing debuggers offer some support for explaining the presence of events, usually by providing the equivalent of a backtrace in conventional debuggers, but they are not very good at answering "Why not?" questions: there is simply no starting point for a possible backtrace. In this paper, we show that the
more » ... concept of negative provenance can be used to explain the absence of events in distributed systems. Negative provenance relies on counterfactual reasoning to identify the conditions under which the missing event could have occurred. We define a formal model of negative provenance for distributed systems, and we present the design of a system called Y! that tracks both positive and negative provenance and can use them to answer diagnostic queries. We describe how we have used Y! to debug several realistic problems in two application domains: softwaredefined networks and BGP interdomain routing. Results from our experimental evaluation show that the overhead of Y! is moderate.
doi:10.1145/2619239.2626335 dblp:conf/sigcomm/WuZHZL14 fatcat:ojkwu6xxyjgfjj2zkxbqhqccye