1,985 Hits in 6.6 sec

Local SGD: Unified Theory and New Efficient Methods [article]

Eduard Gorbunov, Filip Hanzely, Peter Richtárik
2020 arXiv   pre-print
We present a unified framework for analyzing local SGD methods in the convex and strongly convex regimes for distributed/federated training of supervised machine learning models.  ...  We recover several known methods as a special case of our general framework, including Local-SGD/FedAvg, SCAFFOLD, and several variants of SGD not originally designed for federated learning.  ...  Gorbunov was also partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03 and RFBR, project number 19-31-51001.  ... 
arXiv:2011.02828v1 fatcat:regduy5gpfbh5lh2lebp4xirua

Is Local SGD Better than Minibatch SGD? [article]

Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro
2020 arXiv   pre-print
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method.  ...  we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least  ...  Acknowledgements This work is partially supported by NSF-CCF/BSF award 1718970/2016741, NSF-DMS 1547396, and a Google Faculty Research Award. BW is supported by a Google PhD Fellowship.  ... 
arXiv:2002.07839v2 fatcat:5sgn5ondgjaetawz3ij74e4qn4

Tighter Theory for Local SGD on Identical and Heterogeneous Data [article]

Ahmed Khaled and Konstantin Mishchenko and Peter Richtárik
2022 arXiv   pre-print
Our bounds are based on a new notion of variance that is specific to local SGD methods with different data.  ...  We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous.  ...  To properly discuss the efficiency of local SGD, we also need a practical way of quantifying it.  ... 
arXiv:1909.04746v4 fatcat:s67ponrs5zfgzctnigz347lfe4

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms

Jianyu Wang, Gauri Joshi
2021 Journal of machine learning research  
and provides a unified convergence analysis.  ...  and 2) improvements upon previous analyses of local SGD and decentralized parallel SGD.  ...  The experiments were conducted on the ORCA cluster provided by the Parallel Data Lab at CMU, and on Amazon AWS (supported by an AWS credit grant).  ... 
dblp:journals/jmlr/WangJ21 fatcat:ie6bp3j24zecfajt3erxoxem6i

Statistical Estimation and Inference via Local SGD in Federated Learning [article]

Xiang Li, Jiadong Liang, Xiangyu Chang, Zhihua Zhang
2021 arXiv   pre-print
Our theoretical and empirical results show that Local SGD simultaneously achieves both statistical efficiency and communication efficiency.  ...  Both the methods are communication efficient and applicable to online data.  ...  Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized SGD with changing topology and local updates.  ... 
arXiv:2109.01326v2 fatcat:wn4ggjxlbbghtofupwn6eeem44

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates [article]

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, Sebastian U. Stich
2021 arXiv   pre-print
coorperative SGD and federated averaging (local SGD).  ...  Decentralized stochastic optimization methods have gained a lot of attention recently, mainly because of their cheap per iteration cost, data locality, and their communication-efficiency.  ...  Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv preprint arXiv:1808.07576, 2018.  ... 
arXiv:2003.10422v3 fatcat:wkmdbmnv6zdl3ebbuz3gt3jgdi

Communication-efficient Decentralized Local SGD over Undirected Networks [article]

Tiancheng Qin, S. Rasoul Etesami, César A. Uribe
2020 arXiv   pre-print
We study the Decentralized Local SDG method, where agents perform a number of local gradient steps and occasionally exchange information with their neighbors.  ...  Agents have access to F through noisy gradients, and they can locally communicate with their neighbors a network.  ...  In Koloskova et al. (2020) , the authors introduced a unifying theory for decentralized SGD and Local updates.  ... 
arXiv:2011.03255v1 fatcat:ntl52pisqnbg3o5sgla27md5lq

Local SGD With a Communication Overhead Depending Only on the Number of Workers [article]

Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
2020 arXiv   pre-print
In this paper, we give a new analysis of Local SGD.  ...  The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications.  ...  A unified theory of decentralized sgd with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020. [KMR20] A Khaled, K Mishchenko, and P Richtárik.  ... 
arXiv:2006.02582v1 fatcat:zwknwwiepvgznlljlgebm7xnmu

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms [article]

Jianyu Wang, Gauri Joshi
2019 arXiv   pre-print
Communication-efficient SGD algorithms, which allow nodes to perform local updates and periodically synchronize local models, are highly effective in improving the speed and scalability of distributed  ...  This paper presents a unified framework called Cooperative SGD that subsumes existing communication-efficient SGD algorithms such as periodic-averaging, elastic-averaging and decentralized SGD.  ...  Acknowledgments The authors thank Anit Kumar Sahu for his suggestions and feedback. This work was partially supported by the CMU Dean's fellowship and an IBM Faculty Award.  ... 
arXiv:1808.07576v3 fatcat:vjnt3h7ue5d55brdwdfpivhs34

Linearly Converging Error Compensated SGD [article]

Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, Peter Richtárik
2020 arXiv   pre-print
In this paper, we propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates.  ...  Moreover, for the case when the loss function of the worker has the form of finite sum, we modified the method and got a new one called EC-LSVRG-DIANA which is the first distributed stochastic method with  ...  Acknowledgments and Disclosure of Funding The work of Peter Richtárik, Eduard Gorbunov and Dmitry Kovalev was supported by KAUST Baseline Research Fund. Part of this work was done while E.  ... 
arXiv:2010.12292v1 fatcat:uw7cgz7ysbctfbwr6wmbi4k66i

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices [article]

Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko
2022 arXiv   pre-print
We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees.  ...  In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth.  ...  Acknowledgements We would like to thank Anastasia Koloskova, Liudmila Prokhorenkova and Anton Osokin for helpful feedback and discussions.  ... 
arXiv:2103.03239v4 fatcat:awtrxschgne6xebwlxnl6bown4

Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology [article]

Yan Huang, Ying Sun, Zehan Zhu, Changzhi Yan, Jinming Xu
2022 arXiv   pre-print
VR) and gradient-tracking (GT) methods such as SAGA, Local-SVRG and GT-SAGA.  ...  By designing properly the topology of the augmented graph, we are able to recover as special cases the renowned Local-SGD and DSGD algorithms, and provide a unified perspective for variance-reduction (  ...  New efficient algorithms.  ... 
arXiv:2207.03730v1 fatcat:v4wtrwohfjcrjdym2v6gcxqjd4

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation [article]

Robert M. Gower, Othmane Sebbouh, Nicolas Loizou
2021 arXiv   pre-print
Yet, the standard convergence theory for SGD in the smooth non-convex setting gives a slow sublinear convergence to a stationary point.  ...  We provide theoretical guarantees for the convergence of SGD for different step-size selections including constant, decreasing and the recently proposed stochastic Polyak step-size.  ...  A unified theory of decentralized SGD with changing topology and local updates. ICML.Lee, J. C. H. and Valiant, P. (2016). Optimizing starconvex functions. In FOCS.  ... 
arXiv:2006.10311v3 fatcat:g7dqyu7775hwtbrnopgcqjg6te

Global Momentum Compression for Sparse Communication in Distributed SGD [article]

Shen-Yi Zhao, Yin-Peng Xie, Hao Gao, Wu-Jun Li
2019 arXiv   pre-print
GMC also combines memory gradient and momentum SGD. But different from DGC which adopts local momentum, GMC adopts global momentum.  ...  Recently, there has appeared one method, called deep gradient compression (DGC), to combine memory gradient and momentum SGD for sparse communication.  ...  One of the efficient ways to solve (1) is stochastic gradient descent (SGD) (Robbins and Monro, 1951) .  ... 
arXiv:1905.12948v2 fatcat:35bm5h5htnctfnvlaa2vur5rni

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning [article]

Fanhua Shang, Kaiwen Zhou, Hongying Liu, James Cheng, Ivor W. Tsang, Lijun Zhang, Dacheng Tao, Licheng Jiao
2018 arXiv   pre-print
Experimental results show that VR-SGD converges significantly faster than SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha.  ...  Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively.  ...  This means that our VR-SGD method can use much larger learning rates than SVRG both in theory and in practice.  ... 
arXiv:1802.09932v2 fatcat:lhq2qbgmvjedbaxvpubbzkcv3q
« Previous Showing results 1 — 15 out of 1,985 results