3,892 Hits in 2.8 sec

Is Local SGD Better than Minibatch SGD? [article]

Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro
2020 arXiv   pre-print
sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD  ...  we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least  ...  Acknowledgements This work is partially supported by NSF-CCF/BSF award 1718970/2016741, NSF-DMS 1547396, and a Google Faculty Research Award. BW is supported by a Google PhD Fellowship.  ... 
arXiv:2002.07839v2 fatcat:5sgn5ondgjaetawz3ij74e4qn4

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays [article]

Konstantin Mishchenko, Francis Bach, Mathieu Even, Blake Woodworth
2022 arXiv   pre-print
Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider.  ...  The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay.  ...  better than the Minibatch SGD algorithm described earlier.  ... 
arXiv:2206.07638v1 fatcat:efursasonjcm5oe2jum2ptbkza

Minibatch vs Local SGD for Heterogeneous Distributed Learning [article]

Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro
2022 arXiv   pre-print
We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and  ...  (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.  ...  Acknowledgements This work is partially supported by NSF/BSF award 1718970, NSF-DMS 1547396, and a Google Faculty Research Award. BW is supported by a Google PhD Fellowship.  ... 
arXiv:2006.04735v5 fatcat:dnjxqfzierhhhehjlu4cbeatpi

Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning [article]

Tomoya Murata, Taiji Suzuki
2021 arXiv   pre-print
Theoretically, under small heterogeneity of local objectives, we show that BVR-L-SGD achieves better communication complexity than both the previous non-local and local methods under mild conditions, and  ...  However, the superiority of local SGD to minibatch SGD only holds in quite limited situations.  ...  The communication complexity of BVR-L-SGD has a better dependence on ε than minibatch SGD, local SGD and SCAFFOLD.  ... 
arXiv:2102.03198v2 fatcat:4njzp2tgmfcy7eawiy3lpjgq4q

Strength of Minibatch Noise in SGD [article]

Liu Ziyin, Kangqiao Liu, Takashi Mori, Masahito Ueda
2022 arXiv   pre-print
The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning.  ...  This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum.  ...  Ziyin is supported by the GSS Scholarship of The University of Tokyo. Kangqiao Liu was supported by the GSGC program of the University of Tokyo.  ... 
arXiv:2102.05375v3 fatcat:kmxmaqi6rncjnono5xadthvy4q

Trade-offs of Local SGD at Scale: An Empirical Study [article]

Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos, Nicolas Ballas
2021 arXiv   pre-print
This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale.  ...  One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD.  ...  We perform a comprehensive empirical study on ImageNet that identifies previously unreported scalability limitations of local and post-local SGD.  ... 
arXiv:2110.08133v1 fatcat:bsukuvzllvcxvopeuk6ghspt6e

SGD with a Constant Large Learning Rate Can Converge to Local Maxima [article]

Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda
2022 arXiv   pre-print
Specifically, we construct landscapes and data distributions such that (1) SGD converges to local maxima, (2) SGD escapes saddle points arbitrarily slowly, (3) SGD prefers sharp minima over flat ones,  ...  and (4) AMSGrad converges to local maxima.  ...  Ziyin is financially supported by the GSS Scholarship from the University of Tokyo. BL acknowledges CNRS for financial support and Werner Krauth for all kinds of help.  ... 
arXiv:2107.11774v3 fatcat:c3r24vjerrauvojeglhijccv7e

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period [article]

Shuheng Shen, Yifei Cheng, Jingchang Liu, Linli Xu
2020 arXiv   pre-print
Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity.  ...  Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of O (N^3/2 T^1/2) and O (N^3/4 T^3/4) when the data distributions  ...  Stagewise training is also verified to achieve better testing error than general SGD [40] . Large Batch SGD (LB-SGD).  ... 
arXiv:2006.06377v2 fatcat:ftz2loc74rcv3hsph6jcinjrfe

Learning Curves for SGD on Structured Features [article]

Blake Bordelon, Cengiz Pehlevan
2022 arXiv   pre-print
We show that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes  ...  To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) on mean square loss which predicts test loss when training on  ...  A series of more recent works have considered the over-parameterized (possibly infinite dimension) setting for SGD, deriving power law test loss curves emerge with exponents which are better than the O  ... 
arXiv:2106.02713v5 fatcat:oorda7iplvgyjb2w3vkya7vt6u

Tighter Theory for Local SGD on Identical and Heterogeneous Data [article]

Ahmed Khaled and Konstantin Mishchenko and Peter Richtárik
2022 arXiv   pre-print
Our bounds are based on a new notion of variance that is specific to local SGD methods with different data.  ...  The tightness of our results is guaranteed by recovering known statements when we plug H=1, where H is the number of local steps.  ...  ., 2018) , but unfortunately their guarantee is strictly worse than that of minibatch Stochastic Gradient Descent (SGD).  ... 
arXiv:1909.04746v4 fatcat:s67ponrs5zfgzctnigz347lfe4

Shift-Curvature, SGD, and Generalization [article]

Arwen V. Bradley, Carlos Alberto Gomez-Uribe, Manish Reddy Vuyyuru
2022 arXiv   pre-print
The shift in the shift-curvature is the line connecting train and test local minima, which differ due to dataset sampling or distribution shift.  ...  A longstanding debate surrounds the related hypotheses that low-curvature minima generalize better, and that SGD discourages curvature. We offer a more complete and nuanced view in support of both.  ...  t = − λ B B i=1 ∂ θ U (x i , θ), ( 9 ) where λ is the learning rate, B is the minibatch size, and the x i are i.i.d. samples from the train set that comprise the minibatch.  ... 
arXiv:2108.09507v3 fatcat:o4d4epv6xbbdblu7vqwwxm2agy

Don't Use Large Mini-Batches, Use Local SGD [article]

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
2020 arXiv   pre-print
We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of local SGD variants.  ...  As a remedy, we propose a post-local SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency  ...  In direct comparison, post-local SGD is more communication-efficient than mini-batch SGD (while less than local SGD). It achieves better generalization performance than both these algorithms.  ... 
arXiv:1808.07217v6 fatcat:7cmirv2pxrfafh24xjryn5a7bm

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation [article]

Robert M. Gower, Othmane Sebbouh, Nicolas Loizou
2021 arXiv   pre-print
Stochastic Gradient Descent (SGD) is being used routinely for optimizing non-convex functions.  ...  Our analysis relies on an Expected Residual condition which we show is a strictly weaker assumption than previously used growth conditions, expected smoothness or bounded variance assumptions.  ...  Deep learning without poor local minima. In NeurIPS. Khaled, A. and Richtarik, P. (2020). Better theory for SGD in the nonconvex world. arXiv:2002.03329. Kingma, D. and Ba, J. (2015).  ... 
arXiv:2006.10311v3 fatcat:g7dqyu7775hwtbrnopgcqjg6te

A Light Touch for Heavily Constrained SGD [article]

Andrew Cotter, Maya Gupta, Jan Pfeifer
2016 arXiv   pre-print
Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update.  ...  For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration.  ...  This motivates us to learn to predict the most-violated constraint, ideally at a significantly better than linear-in-m rate.  ... 
arXiv:1512.04960v2 fatcat:vpo6obgmx5h7pliyrfjpuza63i

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors [article]

Gintare Karolina Dziugaite, Daniel M. Roy
2019 arXiv   pre-print
Entropy-SGD works by optimizing the bound's prior, violating the hypothesis of the PAC-Bayes theorem that the prior is chosen independently of the data.  ...  Indeed, available implementations of Entropy-SGD rapidly obtain zero training error on random labels and the same holds of the Gibbs posterior.  ...  GKD is supported by an EPSRC studentship. DMR is supported by an NSERC Discovery Grant, Connaught Award, Ontario Early Researcher Award, and U.S.  ... 
arXiv:1712.09376v3 fatcat:l3fssx5csbhedcrtl2ojaaznle
« Previous Showing results 1 — 15 out of 3,892 results