Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale

Thao-Nguyen TRUONG, Ryousei TAKANO
2021 IEICE transactions on information and systems  
Thao-Nguyen TRUONG †a) , Nonmember and Ryousei TAKANO †b) , Member SUMMARY Data parallelism is the dominant method used to train deep learning (DL) models on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). Although some communication techniques have been proposed to cope with this
more » ... lem, all of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training is long-lived and rarely changed that can be speed-up with optical switching. Simulation results on the Simgrid simulator show that our approach speed-up the training time of deep learning applications, especially in a large-scale manner. key words: distributed deep learning, high performance computing (HPC), optical circuit switching, hybrid switching
doi:10.1587/transinf.2020edp7201 fatcat:jljuf3xivnc2dimnw3dwl7xvpe