Aspen trees

Meg Walraed-Sullivan, Amin Vahdat, Keith Marzullo
2013 Proceedings of the ninth ACM conference on Emerging networking experiments and technologies - CoNEXT '13  
Fault recovery is a key issue in modern data centers. In a fat tree topology, a single link failure can disconnect a set of end hosts from the rest of the network until updated routing information is disseminated to every switch in the topology. e time for re-convergence can be substantial, leaving hosts disconnected for long periods of time and signi cantly reducing the overall availability of the data center. Moreover, the message overhead of sending updated routing information to the entire
more » ... opology may be unacceptable at scale. We present techniques to modify hierarchical data center topologies to enable switches to react to failures locally, thus reducing both the convergence time and control overhead of failure recovery. We nd that for a given network size, decreasing a topology's convergence time results in a proportional decrease to its scalability (e.g. the number of hosts supported). On the other hand, reducing convergence time without a ecting scalability necessitates the introduction of additional switches and links. We explore the tradeo s between fault tolerance, scalability and network size, and propose a range of modi ed multi-rooted tree topologies that provide signi cantly reduced convergence time while retaining most of the traditional fat tree's desirable properties.
doi:10.1145/2535372.2535383 dblp:conf/conext/Walraed-SullivanVM13 fatcat:33esu4elzjguhgwfhvcxsltig4