A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is
In this work we propose DC-S3GD, a decentralized (without Parameter Server) stale-synchronous version of the Delay-Compensated Asynchronous Stochastic Gradient Descent (DC-ASGD) algorithm. ... Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. ... In recent years, large-scale training was obtained by using different flavors of the most classic synchronous scheme, that is Synchronous SGD, in conjunction with decentralized communication. ...arXiv:1911.02516v1 fatcat:uev2oh4qjref7kezoxj4bsepxa
Neural network training is commonly accelerated by using multiple synchronized workers to compute gradient updates in parallel. ... We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays D ≥ 32 update steps with minimal drop in final test accuracy. ... Acknowledgements We thank Joel Hestness, Vithursan Thangarasa, and Xin Wang for for their help and feedback that improved the manuscript. ...arXiv:2007.01397v2 fatcat:wglfxomtn5hjpm67zjlkfa2m7q