9,843 Hits in 7.9 sec

Conjugate Directions for Stochastic Gradient Descent [chapter]

Nicol N. Schraudolph, Thore Graepel
2002 Lecture Notes in Computer Science  
In our benchmark experiments the resulting online learning algorithms converge orders of magnitude faster than ordinary stochastic gradient descent.  ...  The method of conjugate gradients provides a very effective way to optimize large, deterministic systems by gradient descent.  ...  The state of the art for such stochastic problems is therefore simple gradient descent, coupled with adaptation of local step size and/or momentum parameters. Curvature matrix-vector products.  ... 
doi:10.1007/3-540-46084-5_218 fatcat:mye3y4evavggdekufwafdd53ba

Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [article]

Ricky T. Q. Chen, Dami Choi, Lukas Balles, David Duvenaud, Philipp Hennig
2020 arXiv   pre-print
The smoothness of our updates makes it more amenable to simple step size selection schemes, which we also base off of our estimates quantities.  ...  We prove that our model-based procedure converges in the noisy quadratic setting.  ...  Gradient descent (GD) converges nicely with the high learning rate, and using adaptive steps sizes leads to a better convergence rate.  ... 
arXiv:2011.04803v1 fatcat:pek6vnmdfrg77hphsfxgg56f4e

A Stochastic Quasi-Newton Method for Online Convex Optimization

Nicol N. Schraudolph, Jin Yu, Simon Günter
2007 Journal of machine learning research  
The resulting algorithm performs comparably to a well-tuned natural gradient descent but is scalable to very high-dimensional problems.  ...  We are working on analyzing the convergence of online (L)BFGS, and extending it to nonconvex optimization problems.  ...  Schraudolph (1999 Schraudolph ( , 2002) ) further accelerates stochastic gradient descent through online adaptation of a gain vector.  ... 
dblp:journals/jmlr/SchraudolphYG07 fatcat:4cn5f3h4czbmvhqn42dtlg5o2y

Quasi-Newton methods: superlinear convergence without line searches for self-concordant functions

Wenbo Gao, Donald Goldfarb
2018 Optimization Methods and Software  
of stochastic gradient descent on stochastic optimization problems.  ...  We consider the use of a curvature-adaptive step size in gradient-based iterative methods, including quasi-Newton methods, for minimizing self-concordant functions, extending an approach first proposed  ...  Acknowledgements We would like to thank Jorge Nocedal for carefully reading and providing very helpful suggestions for improving an earlier version of this paper.  ... 
doi:10.1080/10556788.2018.1510927 fatcat:xtmmqflsq5grplwt3guuysd6ty

Quasi-Newton Methods: Superlinear Convergence Without Line Searches for Self-Concordant Functions [article]

Wenbo Gao, Donald Goldfarb
2018 arXiv   pre-print
of stochastic gradient descent on stochastic optimization problems.  ...  We consider the use of a curvature-adaptive step size in gradient-based iterative methods, including quasi-Newton methods, for minimizing self-concordant functions, extending an approach first proposed  ...  Acknowledgements We would like to thank Jorge Nocedal for carefully reading and providing very helpful suggestions for improving an earlier version of this paper.  ... 
arXiv:1612.06965v3 fatcat:5dt4s3uemvatpl7enjv43rdcsm

Near optimal step size and momentum in gradient descent for quadratic functions

Engin TAŞ, Memmedağa MEMMEDLİ
2017 Turkish Journal of Mathematics  
We propose to determine near-optimal step size and momentum factor (19) simultaneously for gradient descent in a stochastic quadratic bowl from the largest and smallest eigenvalue of the Hessian.  ...  Step size and momentum factor should be carefully tuned in order to take advantage of the safe, global convergence properties of the gradient descent method.  ... 
doi:10.3906/mat-1411-51 fatcat:ndxlpxuajjbj5hi6f7jhx2xu6m

Adaptive Learning Rate and Momentum for Training Deep Neural Networks [article]

Zhiyong Hao, Yixuan Jiang, Huihua Yu, Hsiao-Dong Chiang
2021 arXiv   pre-print
On the one hand, a quadratic line-search determines the step size according to current loss landscape.  ...  In this paper, we develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. We propose the Conjugate Gradient with Quadratic line-search (CGQ) method.  ...  In fact, CG can be understood as a Gradient Descent with an adaptive step size and dynamically updated momentum.  ... 
arXiv:2106.11548v2 fatcat:ie4raizeuffaxinge3x3ad72w4

Speeding-Up Convergence via Sequential Subspace Optimization: Current State and Future Directions [article]

Michael Zibulevsky
2013 arXiv   pre-print
We explored its combination with Parallel Coordinate Descent and Separable Surrogate Function methods, obtaining state of the art results in above-mentioned areas.  ...  multiplication by Hessian is available; Stochastic optimization methods - for problems with large stochastic-type data; Multigrid methods - for problems with nested multilevel structure.  ...  direction is good, local quadratic convergence rate of Newton is preserved.  ... 
arXiv:1401.0159v1 fatcat:j4pysvjrqjfz7ozapzxpn7tj4a

ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient [article]

Caglar Gulcehre, Marcin Moczulski, Yoshua Bengio
2015 arXiv   pre-print
The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients.  ...  The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients.  ...  Acknowledgments We thank the developers of Theano [2] and Pylearn2 [5] and the computational resources provided by Compute Canada and Calcul Québec.  ... 
arXiv:1412.7419v5 fatcat:up5ga7ggxrdtriofoki5r7xvim

The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems [article]

Vivak Patel
2018 arXiv   pre-print
We derive this mechanism based on a detailed analysis of a generic stochastic quadratic problem, which generalizes known results for classical gradient descent.  ...  In several experimental reports on nonconvex optimization problems in machine learning, stochastic gradient descent (SGD) was observed to prefer minimizers with flat basins in comparison to more deterministic  ...  We would also like to thank Mihai Anitescu for his general guidance throughout 29 the preparation of this work. Funding The author is supported by the NSF Research and Training Grant # 1547396.  ... 
arXiv:1709.04718v2 fatcat:qksvlnw2xzcuflt74rztqprjwi

Semistochastic Quadratic Bound Methods [article]

Aleksandr Y. Aravkin, Anna Choromanska, Tony Jebara, Dimitri Kanevsky
2014 arXiv   pre-print
The efficacy of SQB methods is demonstrated via comparison with several state-of-the-art techniques on commonly used datasets.  ...  Semistochastic methods fall in between batch algorithms, which use all the data, and stochastic gradient type methods, which use small random selections at each iteration.  ...  Bottou implementation but with pre-specified step size• SAG: stochastic average gradient method using the estimate of Lipschitz constant L • ASGD: averaged stochastic gradient descent method with  ... 
arXiv:1309.1369v4 fatcat:ps3bytcwmvcvveodqfbb6xp2xi

The Geometry of Sign Gradient Descent [article]

Lukas Balles and Fabian Pedregosa and Nicolas Le Roux
2020 arXiv   pre-print
Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the ℓ_∞-norm.  ...  Furthermore, they are closely connected to so-called adaptive gradient methods like Adam.  ...  Lukas Balles kindly acknowledges the support of the International Max Planck Research School for Intelligent Systems (IMPRS-IS) as well as financial support by the European Research Council through ERC  ... 
arXiv:2002.08056v1 fatcat:uakvuoahbzh5disayo3x7lwxcm

Learning to Accelerate by the Methods of Step-size Planning [article]

Hengshuai Yao
2022 arXiv   pre-print
Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation.  ...  The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation  ...  The meta-gradient approach of scalar step-size adaptation cannot converge faster than the linear rate for quadratic optimization as given in the above theorem. (A2.)  ... 
arXiv:2204.01705v4 fatcat:mopclv3rgngn7gezi6fmtzi3je

Online Regularized Nonlinear Acceleration [article]

Damien Scieur, Edouard Oyallon, Alexandre d'Aspremont, Francis Bach
2019 arXiv   pre-print
The new scheme provably improves the rate of convergence of fixed step gradient descent, and its empirical performance is comparable to that of quasi-Newton methods.  ...  Here, we adapt RNA to overcome these issues, so that our scheme can be used on fast algorithms such as gradient methods with momentum.  ...  Edouard Oyallon was partially supported by a postdoctoral grant from DPEI of Inria (AAR 2017POD057) for the collaboration with CWI.  ... 
arXiv:1805.09639v2 fatcat:xzysmjgsrjafvhovqwuymdk7ga

Meta-descent for Online, Continual Prediction [article]

Andrew Jacobsen, Matthew Schlegel, Cameron Linke, Thomas Degris, Adam White, Martha White
2019 arXiv   pre-print
Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes.  ...  Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error.  ...  To minimize (4), we use stochastic gradient descent, and thus need to compute the gradient of ∆ t (w t (α)) 2 2 w.r.t. the step-size α.  ... 
arXiv:1907.07751v2 fatcat:6gvgs2klsbadrlpr3szgelilfe
« Previous Showing results 1 — 15 out of 9,843 results