780 Hits in 6.6 sec

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [article]

Satyen Kale, Ayush Sekhari, Karthik Sridharan
2021 arXiv   pre-print
In this paper, we consider the problem of SCO and explore the role of implicit regularization, batch size and multiple epochs for SGD.  ...  Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models.  ...  Acknowledgements We thank Dylan Foster, Roi Livni, Robert Kleinberg and Mehryar Mohri for helpful discussions. AS was an intern at Google Research, NY when a part of the work was performed.  ... 
arXiv:2107.05074v1 fatcat:uan3uyb2krbqdkiufemgfneuo4

Label Noise SGD Provably Prefers Flat Global Minimizers [article]

Alex Damian, Tengyu Ma, Jason D. Lee
2021 arXiv   pre-print
, strength of the label noise, and the batch size, and R(θ) is an explicit regularizer that penalizes sharp minimizers.  ...  Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise.  ...  More specifically, we compute the average of k∇L̂(k) (θk ) − ∇L(k) (θk )k2 over an epoch and then renormalize by the batch size.  ... 
arXiv:2106.06530v2 fatcat:4azph7lkhjazzdhs7uzb357joa

Path-SGD: Path-Normalized Optimization in Deep Neural Networks [article]

Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro
2015 arXiv   pre-print
We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise  ...  We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights.  ...  Acknowledgments Research was partially funded by NSF award IIS-1302662 and Intel ICRI-CI. We thank Hao Tang for insightful discussions.  ... 
arXiv:1506.02617v1 fatcat:isriewanhrcyvgg5x2ycglji54

Painless step size adaptation for SGD [article]

Ilona Kulikovskikh, Tarzan Legović
2021 arXiv   pre-print
To avoid the conflict, recent studies suggest adopting a moderately large step size for optimizers, but the added value on the performance remains unclear.  ...  This contribution allows to: 1) improve both convergence and generalization of neural networks with no need to guarantee their stability; 2) build more reliable and explainable network architectures with  ...  The occurrence of an implicit regularizer demystifies this matter as well. For a small step size, SGD behaves similar to GD on the full batch loss function.  ... 
arXiv:2102.00853v1 fatcat:3h4gki5apvfhjfa5l5wnk7b6du

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks [article]

Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, Ji Liu
2019 arXiv   pre-print
In this paper, we propose a novel momentum-SGD-based optimization method to reduce the network complexity by on-the-fly pruning.  ...  ; and 4) superior capability to find better winning tickets which have won the initialization lottery.  ...  Acknowledgement We sincerely thank all the reviewers for their comments. This work was supported by the National Key  ... 
arXiv:1909.12778v3 fatcat:hh4wnkm2xvdj3nianidyliol6e

Bad Global Minima Exist and SGD Can Reach Them [article]

Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas
2021 arXiv   pre-print
The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization.  ...  In contrast, we find that in the presence of explicit regularization, pretraining with random labels has no detrimental effect on SGD.  ...  Acknowledgements Dimitris Papailiopoulos is supported by an NSF CAREER Award #1844951, two Sony Faculty Innovation Awards, an AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, and an NSF TRIPODS  ... 
arXiv:1906.02613v2 fatcat:bh3pgrt3jvddxhknmreimoipl4

On the Noisy Gradient Descent that Generalizes as SGD [article]

Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu
2020 arXiv   pre-print
Our finding is based on a novel observation on the structure of the SGD noise: it is the multiplication of the gradient matrix and a sampling noise that arises from the mini-batch sampling procedure.  ...  The gradient noise of SGD is considered to play a central role in the observed strong generalization abilities of deep learning.  ...  For instance considering mini-batch SGD without replacement, the sampling vector W sgd contains exactly b multiples of 1 b and n − b multiples of zero with random index.  ... 
arXiv:1906.07405v3 fatcat:odo7cpoht5cdzk5o5auuog7g7q

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) [article]

Zhiyuan Li, Sadhika Malladi, Sanjeev Arora
2021 arXiv   pre-print
(c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.  ...  Experimental verification of the approximation appears computationally infeasible.  ...  Acknowledgement The authors acknowledge support from NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC.  ... 
arXiv:2102.12470v2 fatcat:n533sixfgra4nhpgp7x3sgvy34

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors [article]

Gintare Karolina Dziugaite, Daniel M. Roy
2019 arXiv   pre-print
Indeed, available implementations of Entropy-SGD rapidly obtain zero training error on random labels and the same holds of the Gibbs posterior.  ...  Entropy-SGD works by optimizing the bound's prior, violating the hypothesis of the PAC-Bayes theorem that the prior is chosen independently of the data.  ...  Acknowledgments This research was carried out in part while the authors were visiting the Simons Institute for the Theory of Computing at UC Berkeley.  ... 
arXiv:1712.09376v3 fatcat:l3fssx5csbhedcrtl2ojaaznle

On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications [article]

Ziqiao Wang, Yongyi Mao
2021 arXiv   pre-print
They also point to a new and simple regularization scheme which we show performs comparably to the current state of the art.  ...  Experimental study based on these bounds provide some insights on the SGD training of neural networks.  ...  The learning rate and batch size in SGD have explicitly appeared in the trajectory term of Eq.1 in Theorem 2.  ... 
arXiv:2110.03128v1 fatcat:h44nwenhx5fupoqg5obsink7py

Strength of Minibatch Noise in SGD [article]

Liu Ziyin, Kangqiao Liu, Takashi Mori, Masahito Ueda
2021 arXiv   pre-print
In this work, we study the nature of SGD noise and fluctuation.  ...  We show that some degree of mismatch between model and data complexity is needed for SGD to "stir" a noise; such mismatch may be due to a label or input noise, regularization, or underparametrization.  ...  Therefore, in discrete-time, choosing a γ-prior forbids part of the solutions to be found by the SGD dynamics, and this part of the forbidding region can be seen as an "implicit prior" of SGD.  ... 
arXiv:2102.05375v2 fatcat:ves2rq6iwngdrpxkqv33immhs4

Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion [article]

Daniel Kunin, Javier Sagastuy-Brena, Lauren Gillespie, Eshed Margalit, Hidenori Tanaka, Surya Ganguli, Daniel L. K. Yamins
2021 arXiv   pre-print
To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation.  ...  Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the  ...  We hope our newly derived understanding of the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future  ... 
arXiv:2107.09133v3 fatcat:5aoly6xqprdrtmo7etnf4muhuy

Weighted SGD for ℓ_p Regression with Randomized Preconditioning [article]

Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney
2017 arXiv   pre-print
This complexity is uniformly better than that of RLA methods in terms of both $\epsilon$ and $d$ when the problem is unconstrained.  ...  Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets.  ...  We would like to acknowledge the Army Research Office, the Defense Advanced Research Projects Agency, and the Department of Energy for providing partial support for this work.  ... 
arXiv:1502.03571v5 fatcat:iwnnvkbra5gm7mlqj2vrstji2i

Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise [article]

Spencer Frei and Yuan Cao and Quanquan Gu
2021 arXiv   pre-print
We prove that SGD produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave  ...  To the best of our knowledge, this is the first work to show that overparameterized neural networks trained by SGD can generalize when the data is corrupted with adversarial label noise.  ...  Acknowledgements We thank James-Michael Leahy for a number of helpful discussions. We thank Maria-Florina Balcan for pointing us to a number of works on learning halfspaces in the presence of noise.  ... 
arXiv:2101.01152v3 fatcat:rhygrb6cmrcslbumz3panrv6ym

Weighted SGD for ℓp Regression with Randomized Preconditioning

Jiyan Yang, Yin-Lam Chow, Christopher Ré, Michael W. Mahoney
2015 Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms  
Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.Particularly, when solving ℓ1 regression with size n by d, pwSGD returns an approximate  ...  SGD methods are easy to implement and applicable to a wide range of convex optimization problems.  ...  We would like to acknowledge the Army Research Office, the Defense Advanced Research Projects Agency, and the Department of Energy for providing partial support for this work.  ... 
doi:10.1137/1.9781611974331.ch41 pmid:29782626 pmcid:PMC5959301 fatcat:nu3fbr4wwfflxmdezqrmzwgxxe
« Previous Showing results 1 — 15 out of 780 results