netCSI: A Generic Fault Diagnosis Algorithm for Large-Scale Failures in Computer Networks

Srikar Tati, Scott Rager, Bong Jun Ko, Guohong Cao, Ananthram Swami, Thomas La Porta
2016 IEEE Transactions on Dependable and Secure Computing  
In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When
more » ... orithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128% in accuracy for realistic topologies.
doi:10.1109/tdsc.2014.2369051 fatcat:427lcjjbtjb2nb4xfvubx5uaoq