A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale
2021
IEICE transactions on information and systems
Thao-Nguyen TRUONG †a) , Nonmember and Ryousei TAKANO †b) , Member SUMMARY Data parallelism is the dominant method used to train deep learning (DL) models on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). Although some communication techniques have been proposed to cope with this
doi:10.1587/transinf.2020edp7201
fatcat:jljuf3xivnc2dimnw3dwl7xvpe