Robust Network Tomography in the Presence of Failures

S. Tati, S. Silvestri, T. He, T. La Porta
2014 2014 IEEE 34th International Conference on Distributed Computing Systems  
In this paper, we study the problem of selecting paths to improve the performance of network tomography applications in the presence of network element failures. We model the robustness of paths in network tomography by a metric called expected rank. We formulate an optimization problem to cover two complementary performance metrics: robustness and probing cost. The problem aims at maximizing the expected rank under a budget constraint on the probing cost. We prove that the problem is NP-Hard.
more » ... nder the assumption that the failure distribution is known, we propose an algorithm called RoMe with guaranteed approximation ratio. Moreover, since evaluating the expected rank is generally hard, we provide a bound which can be evaluated efficiently. We also consider the case in which the failure distribution is not known, and propose a reinforcement learning algorithm to solve our optimization problem, using RoMe as a subroutine. We run a wide range of simulations under realistic network topologies and link failure models to evaluate our solution against a state-of-art path selection algorithm. Results show that our approaches provide significant improvements in the performance of network tomography applications under failures. I. INTRODUCTION In the Internet and complex wide-area networks, network management involves a wide range of tasks such as fault detection, performance diagnosis, resource allocation, route selection and congestion control. Most of these tasks require a complete knowledge of internal network state and network topology. Network tomography techniques [1], [2], [3] are proposed to acquire this information efficiently probing only end-to-end (e2e) paths from monitors located at the network edges, instead of directly monitoring every network element. Applications of network tomography include, but are not limited to, inference of individual link performance metrics from given e2e path measurements [1], network topology inference [2] , and estimation of the complete set of e2e measurements from an incomplete set [3] . A commonly adopted approach in network tomography is to formulate a linear system that models the relationship between path measurements and individual link metrics. Given the candidate paths between monitors, state-of-art solutions in network tomography select a subset of these paths, determined by finding an arbitrary basis of the linear system. By probing the paths in a basis, previous approaches [1], [4], [3] reduce the This work is published in the conference of IEEE International Conference on Distributed Computing Systems (ICDCS), 2014 at Madrid, Spain. overhead of collecting e2e measurements while maximizing the performance. Existing work assumes a simple network model, where all network elements are reliable. However, failure of network elements are common events in modern networks due to maintenance procedures, hardware malfunctions, energy outages, or disasters [5] . The typical duration of link failures in IP networks [5] are longer than the lengths of time windows for measurement collection [6] in network tomography. Hence, the link failures may prevent the collection of some measurements, and this degrades the performance of network tomography applications. As a result, previous approaches may perform poorly in the presence of failures. √ e . We note that an exact implementation of RoMe would be hindered by the high complexity in evaluating ER. To address this issue, we derive an analytical bound on ER that can be evaluated efficiently. Moreover, we show that our solution becomes optimal in the more constrained case of selecting only linearly independent paths. In the second scenario, we assume no prior knowledge on
doi:10.1109/icdcs.2014.56 dblp:conf/icdcs/TatiSHP14 fatcat:4sa5pidhizdl5ocy3nvzwmgkra