Resilient Datacenter Load Balancing in the Wild

Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, Mosharaf Chowdhury
2017 Proceedings of the Conference of the ACM Special Interest Group on Data Communication - SIGCOMM '17  
Production datacenters operate under various uncertainties such as tra c dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite signi cant e orts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at xed granularity. On the
more » ... er hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when owlets emerge; thus, they cannot always react timely to uncertainties. To make things worse, these solutions fail to detect/handle failures such as blackholes and random packet drops, which greatly degrades their performance. In this paper, we introduce Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and it reacts using timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modi cation. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance to CONGA and Presto in normal cases, and well handles uncertainties: under asymmetries, Hermes achieves up to 10% and 20% better ow completion time (FCT) than CONGA and CLOVE; under switch failures, it outperforms all other schemes by over 32%.
doi:10.1145/3098822.3098841 dblp:conf/sigcomm/ZhangZB0C17 fatcat:3trdbnmxnfcl5abqlq6x7bc4sa