Filters








186 Hits in 9.7 sec

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [article]

Zeke Xie, Issei Sato, Masashi Sugiyama
2021 arXiv   pre-print
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well.  ...  more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima.  ...  Acknowledgement I would like to express my deep gratitude to Professor Masashi Sugiyama and Professor Issei Sato for their patient guidance and useful critiques of this research work.  ... 
arXiv:2002.03495v14 fatcat:tbavmri37jciziarjz2ybnnt5m

Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective [article]

Guan-Horng Liu, Evangelos A. Theodorou
2019 arXiv   pre-print
It also provides a principled way for hyper-parameter tuning when optimal control theory is introduced.  ...  The review aims to shed lights on the importance of dynamics and optimal control when developing deep learning theory.  ...  As we approach flat local minima, fluctuations from the diffusion become significant.  ... 
arXiv:1908.10920v2 fatcat:rimioom5ofenvdazcx2lke5gu4

How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective

Lei Wu, Chao Ma, Weinan E
2018 Neural Information Processing Systems  
The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability.  ...  In particular, this analysis shows that learning rate and batch size play different roles in minima selection.  ...  Acknowledgement We are grateful to Zhanxing Zhu for very helpful discussions.  ... 
dblp:conf/nips/WuME18 fatcat:kwb46m3zgzec3kikvvq35uqktm

Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum [article]

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, Masashi Sugiyama
2022 arXiv   pre-print
Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks.  ...  However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD).  ...  Prerequisites for SGD Diffusion We first review the SGD diffusion theory for escaping minima proposed by Xie et al. (2020a) .  ... 
arXiv:2006.15815v11 fatcat:xflcm54gifho5osza5vpyogafq

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks [article]

Umut Simsekli, Levent Sagun, Mert Gurbuzbalaban
2019 arXiv   pre-print
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in  ...  This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion.  ...  Acknowledgments This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project.  ... 
arXiv:1901.06053v1 fatcat:3kcf27a74jbcxp4tis3p2onedu

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization [article]

Zeke Xie, Li Yuan, Zhanxing Zhu, Masashi Sugiyama
2021 arXiv   pre-print
It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks.  ...  We theoretically prove the convergence guarantee and the generalization advantage of PNM over Stochastic Gradient Descent (SGD).  ...  Jinze Yu for his helpful discussion. MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.  ... 
arXiv:2103.17182v4 fatcat:hrr6gtbj25frlhwthmao42g5vm

Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent [article]

Wenqing Hu, Zhanxing Zhu, Haoyi Xiong, Jun Huan
2019 arXiv   pre-print
We interpret the variational inference of the Stochastic Gradient Descent (SGD) as minimizing a new potential function named the quasi-potential.  ...  We then consider the dynamics of SGD in the case when the loss function is non-convex and admits several different local minima.  ...  Stochastic gradient descent (see [2] , [1] , [8] ) with a constant learning rate is a stochastic analogue of the gradient descent algorithm, aiming at finding the local or global minimizers of the function  ... 
arXiv:1901.06054v1 fatcat:fwc65mjb7nakfn42jf233h3nha

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks [article]

Umut Şimşekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, Levent Sagun
2019 arXiv   pre-print
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in  ...  This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion.  ...  Acknowledgments The contribution of authors Umut Ş imşekli, Thanh Huy Nguyen, and Gaël Richard to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMA-TRIX  ... 
arXiv:1912.00018v1 fatcat:eyo3rt5dwfffzn5ndktr5su6xy

Optimal Transport for Parameter Identification of Chaotic Dynamics via Invariant Measures [article]

Yunan Yang, Levon Nurbekyan, Elisa Negrini, Robert Martin, Mirjeta Pasha
2022 arXiv   pre-print
We study an optimal transportation approach for recovering parameters in dynamical systems with a single smoothly varying attractor.  ...  In particular, we analyze the regularity of the resulting loss function for general transportation costs and derive gradient formulas.  ...  EN acknowledges that results in this paper were obtained in part using a high-performance computing system acquired through NSF MRI grant DMS-1337943 to WPI.  ... 
arXiv:2104.15138v4 fatcat:bppvwlkwjjf7hiw266e62uy3zq

Power-law escape rate of SGD [article]

Takashi Mori, Liu Ziyin, Kangqiao Liu, Masahito Ueda
2022 arXiv   pre-print
Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss.  ...  This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.  ...  Such unparalleled success of deep learning hinges crucially on stochastic gradient descent (SGD) and its variants as an efficient training algorithm.  ... 
arXiv:2105.09557v2 fatcat:vpf6zyhnqvf6rdhhi2lvfvvx6q

Understanding Short-Range Memory Effects in Deep Neural Networks [article]

Chengli Tan, Jiangshe Zhang, Junmin Liu
2021 arXiv   pre-print
Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging.  ...  The result suggests a lower escaping rate for a larger Hurst parameter, and thus SGD stays longer in flat minima.  ...  More recently, based on the assumption that SGN follows a Gaussian distribution, Xie et al. [ problem and found that SGD favors flat minima exponentially more than sharp minima. A.  ... 
arXiv:2105.02062v4 fatcat:ae3jphed7rhgbjqxcooe6e6gpy

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) [article]

Zhiyuan Li, Sadhika Malladi, Sanjeev Arora
2021 arXiv   pre-print
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets.  ...  Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies  ...  Xie et al. (2021) constructed a SDE-motivated diffusion model to propose why SGD favors flat minima during optimization.  ... 
arXiv:2102.12470v2 fatcat:n533sixfgra4nhpgp7x3sgvy34

The Heavy-Tail Phenomenon in SGD [article]

Mert Gurbuzbalaban, Umut Şimşekli, Lingjiong Zhu
2021 arXiv   pre-print
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning.  ...  We then translate our results into insights about the behavior of SGD in deep learning.  ...  A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2021.  ... 
arXiv:2006.04740v5 fatcat:mse445ken5h7zg2orncurvegsi

Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration [article]

Xuefeng Gao, Mert Gürbüzbalaban, Lingjiong Zhu
2020 arXiv   pre-print
such as stochastic gradient Langevin dynamics (SGLD) in many applications.  ...  Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer  ...  Varadhan for helpful discussions. Xuefeng Gao acknowledges support from Hong Kong RGC Grants 24207015 and 14201117.  ... 
arXiv:1809.04618v4 fatcat:yy5emsbeyfekxloysidf6elpzm

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability [article]

Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar
2021 arXiv   pre-print
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability.  ...  Research Faculty Award, a Carnegie Bosch Institute Research Award, a Facebook Faculty Research Award, and a Block Center Grant.  ...  ACKNOWLEDGEMENTS This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Amazon Web Services Award, a JP Morgan A.I.  ... 
arXiv:2103.00065v2 fatcat:r32apl7rbrhp7bbupzq6wpowqm
« Previous Showing results 1 — 15 out of 186 results