1,341 Hits in 4.4 sec

Rethinking Adam: A Twofold Exponential Moving Average Approach [article]

Yizhou Wang, Yue Kang, Can Qin, Huan Wang, Yi Xu, Yulun Zhang, Yun Fu
2022 arXiv   pre-print
We further develop a theory to back up the improvement in generalization and provide convergence guarantees under both convex and nonconvex settings.  ...  Adaptive gradient methods, e.g. Adam, have achieved tremendous success in machine learning.  ...  ., 2020) adapts stepsizes by the belief in the observed gradients.  ... 
arXiv:2106.11514v3 fatcat:pxrvkdrvfjcqhjka2w6svpe4uq

Optimal Adaptive and Accelerated Stochastic Gradient Descent [article]

Qi Deng and Yi Cheng and Guanghui Lan
2018 arXiv   pre-print
Moreover, acceleration (a.k.a. momentum) methods and diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the slow convergence of Sgd.  ...  In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization  ...  For many learning tasks, Sgd converges slowly and momentum method improves Sgd by adding inertia of the iterates to accelerate the optimization convergence.  ... 
arXiv:1810.00553v1 fatcat:u2mjntzgazdjbb2kehjhjjwokm

L4: Practical loss-based stepsize adaptation for deep learning [article]

Michal Rolinek, Georg Martius
2018 arXiv   pre-print
We demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers.  ...  We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss.  ...  Acknowledgement We would like to thank Alex Kolesnikov, Friedrich Solowjow, and Anna Levina for helping to improve the manuscript.  ... 
arXiv:1802.05074v5 fatcat:6ikxfe5lg5c75bj7qwjtd7zrbe

On the Last Iterate Convergence of Momentum Methods [article]

Xiaoyu Li and Mingrui Liu and Francesco Orabona
2022 arXiv   pre-print
Based on this fact, we study a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with increasing momentum and shrinking updates.  ...  Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD.  ...  Acknowledgements This material is based upon work supported by the National Science Foundation under the grants no. 1925930 "Collaborative Research: TRIPODS Institute for Optimization and Learning", no  ... 
arXiv:2102.07002v3 fatcat:vqekbmatmbcsznkcjz722x4pau

Formal guarantees for heuristic optimization algorithms used in machine learning [article]

Xiaoyu Li
2022 arXiv   pre-print
In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms.  ...  Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems.  ...  This is indeed the case when the optimal solution has low training error and the stochastic gradients are generated by mini-batches.  ... 
arXiv:2208.00502v1 fatcat:xjbrz4kf7rbglpebsoqdi4av2a

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes [article]

Rachel Ward, Xiaoxia Wu, Leon Bottou
2021 arXiv   pre-print
Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread  ...  the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient.  ...  We thank Arthur Szlam and Mark Tygert for constructive suggestions. We also thank Francis Bach, Alexandre Defossez, Ben Recht, Stephen Wright, and Adam Oberman.  ... 
arXiv:1806.01811v8 fatcat:ga6k3lfvzvesbn3uge2duwaxae

Page 2916 of Mathematical Reviews Vol. , Issue 99d [page]

1991 Mathematical Reviews  
Optim. 8 (1998), no. 2, 506-531 (electronic). An increment gradient method with momentum term and adaptive stepsize rule is considered for minimizing the sum of continuously differentiable functions.  ...  Petersburg) 99d:90120 90C30 49M07 Tseng, Paul (1-WA; Seattle, WA) An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. (English summary) SIAM J.  ... 

Dyna: A Method of Momentum for Stochastic Optimization [article]

Zhidong Han
2018 arXiv   pre-print
An algorithm is presented for momentum gradient descent optimization based on the first-order differential equation of the Newtonian dynamics.  ...  The fictitious mass is introduced to the dynamics of momentum for regularizing the adaptive stepsize of each individual parameter.  ...  In order to improve the convergence of the optimization methods, an adaptive stepsize is widely used to stabilize and speed up the learning process.  ... 
arXiv:1805.04933v1 fatcat:bvaeexeuxzb5dp3f72t2wyvd7q

Adam: A Method for Stochastic Optimization [article]

Diederik P. Kingma, Jimmy Ba
2017 arXiv   pre-print
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.  ...  The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.  ...  There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly  ... 
arXiv:1412.6980v9 fatcat:eozdbbdkbjfcxmh2grko5lv4am

FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity [article]

Yangfan Zhou, Kaizhu Huang, Cheng Cheng, Xuguang Wang, Amir Hussain, Xin Liu
2021 arXiv   pre-print
AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients.  ...  In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence  ...  proposed AdaBelief [16] to adapt the stepsize by the belief in observed gradients, which re-designs the second-order momentum as s t = β 2 s t−1 + (1 − β 2 )(g t − m t ) 2 , where m t is the first-order  ... 
arXiv:2104.13790v2 fatcat:25k6vy737bdb5b7eej6bospebm

Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization [article]

Qunwei Li, Yi Zhou, Yingbin Liang, Pramod K. Varshney
2017 arXiv   pre-print
We also extend the analysis to the inexact version of these methods and develop an adaptive momentum strategy that improves the numerical performance.  ...  Then, by exploiting the Kurdyka-Łojasiewicz () property for a broad class of functions, we establish the linear and sub-linear convergence rates of the function value sequence generated by APGnc.  ...  We also proposed an improved algorithm APGnc + by adapting the momentum parameter.  ... 
arXiv:1705.04925v1 fatcat:zkcczurf7fb23fw6c5echbf4ce

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients [article]

Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar Tatikonda, Nicha Dvornek, Xenophon Papademetris, James S. Duncan
2020 arXiv   pre-print
Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum).  ...  Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer.  ...  Acknowledgments and Disclosure of Funding This research is supported by NIH grant R01NS035193.  ... 
arXiv:2010.07468v5 fatcat:25jsftxqorbhnlqqk2rve277bq

On Exploiting Layerwise Gradient Statistics for Effective Training of Deep Neural Networks [article]

Guoqiang Zhang and Kenta Niwa and W. Bastiaan Kleijn
2022 arXiv   pre-print
Adam and AdaBelief compute and make use of elementwise adaptive stepsizes in training deep neural networks (DNNs) by tracking the exponential moving average (EMA) of the squared-gradient g_t^2 and the  ...  Firstly, we slightly modify Adam and AdaBelief by introducing layerwise adaptive stepsizes in their update procedures via either pre- or post-processing.  ...  Firstly, we make a slight modification to Adam and Ad-aBelief by introducing and using layerwise adaptive stepsizes instead of elementwise ones.  ... 
arXiv:2203.13273v4 fatcat:baudgysiibhazawbnm3cct35pu

Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties [article]

Brett Daley, Christopher Amato
2021 arXiv   pre-print
Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes.  ...  We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between  ...  Traditionally, momentum is first applied to the gradient estimator, and then is scaled by the adaptive method (e.g. Adam).  ... 
arXiv:2010.01356v2 fatcat:vtbamin22bb2hlphezrapiiiy4

Learn-and-Adapt Stochastic Dual Gradients for Network Resource Allocation [article]

Tianyi Chen, Qing Ling, Georgios B. Giannakis
2017 arXiv   pre-print
Recognizing the central role of Lagrange multipliers in network resource allocation, a novel learn-and-adapt stochastic dual gradient (LA-SDG) method is developed in this paper to learn the sample-optimal  ...  Lagrange multiplier from historical data, and accordingly adapt the upcoming resource allocation strategy.  ...  Xin Wang, Longbo Huang and Jia Liu for helpful discussions.  ... 
arXiv:1703.01673v2 fatcat:whckbtfwnbbwbevcqc4ezx54ya
« Previous Showing results 1 — 15 out of 1,341 results