A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Escaping Saddles with Stochastic Gradients
[article]
2018
arXiv
pre-print
Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully ...
We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these ...
Experiments In this Section we first show that vanilla SGD (Algorithm 2) as well as GD with a stochastic gradient step as perturbation (Algorithm 1) indeed escape saddle points. ...
arXiv:1803.05999v2
fatcat:ww74scxrovbgnpskbyys7g4veq
Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods
[article]
2021
arXiv
pre-print
Stochastically controlled stochastic gradient (SCSG) methods have been proved to converge efficiently to first-order stationary points which, however, can be saddle points in nonconvex optimization. ...
Simulation studies illustrate that the proposed algorithm can escape saddle points in much fewer epochs than the gradient descent methods perturbed by either noise injection or a SGD step. ...
Stochastic
gradient descent escapes saddle points efficiently. arXiv preprint arXiv:1902.04811, 2019.
Rie Johnson and Tong Zhang. ...
arXiv:2103.04413v3
fatcat:7xfbarbp5zam5e6gfahrtng2aq
Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time
[article]
2018
arXiv
pre-print
All proposed algorithms can escape from saddle points with at most O( d) iterations, which is nearly dimension-free. ...
In this paper, we propose a new adaptive stochastic gradient Langevin dynamics (ASGLD) algorithmic framework and its two specialized versions, namely adaptive stochastic gradient (ASG) and adaptive gradient ...
full gradient information in identifying escape direction, which implies that they do not work with cases where only stochastic gradients are available. ...
arXiv:1805.09416v1
fatcat:5qbqlfpfnnfkth3ogmlwzahw4a
Linear Speedup in Saddle-Point Escape for Decentralized Non-Convex Optimization
[article]
2019
arXiv
pre-print
We establish linear speedup in saddle-point escape time in the number of agents for symmetric combination policies and study the potential for further improvement by employing asymmetric combination weights ...
Under appropriate cooperation protocols and parameter choices, fully decentralized solutions for stochastic optimization have been shown to match the performance of centralized solutions and result in ...
Recently these results have been extended to decentralized optimization with deterministic gradients and random initialization [33] as well as stochastic gradients with diminishing step-size and decay-ing ...
arXiv:1910.13852v1
fatcat:ohvqqkggoneltf4kebtq5qex5q
A Realistic Example in 2 Dimension that Gradient Descent Takes Exponential Time to Escape Saddle Points
[article]
2020
arXiv
pre-print
In this paper we show a negative result: gradient descent may take exponential time to escape saddle points, with non-pathological two dimensional functions. ...
Through our analysis we demonstrate that stochasticity is essential to escape saddle points efficiently. ...
In fact, it is fairly easy to construct some function with an artificial initialization scheme, such that gradient descent will take an exponential number of iterates to escape a saddle point. ...
arXiv:2008.07513v1
fatcat:o2wndqztqzbxtnk7mflrdr3you
Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia
[article]
2021
arXiv
pre-print
We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. ...
Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and minima selection. ...
Escaping saddles with stochastic gradients. In Interna-
tional Conference on Machine Learning, pp. 1155-1164,
2018.
Dauphin, Y. ...
arXiv:2006.15815v9
fatcat:gvbgk7wtd5dvxlpn4vcm2fgg64
Sharp Analysis for Nonconvex SGD Escaping from Saddle Points
[article]
2019
arXiv
pre-print
In this paper, we give a sharp analysis for Stochastic Gradient Descent (SGD) and prove that SGD is able to efficiently escape from saddle points and find an (ϵ, O(ϵ^0.5))-approximate second-order stationary ...
point in Õ(ϵ^-3.5) stochastic gradient computations for generic nonconvex optimization problems, when the objective function satisfies gradient-Lipschitz, Hessian-Lipschitz, and dispersive noise assumptions ...
Acknowledgement The authors would like to greatly thank Chris Junchi Li providing us with a proof of SGD to escape saddle points inÕ( −4 ) computational costs and carefully revising our paper. ...
arXiv:1902.00247v2
fatcat:q2olwny57revbl5z7vytcn5gfq
Saving Gradient and Negative Curvature Computations: Finding Local Minima More Efficiently
[article]
2017
arXiv
pre-print
We propose a family of nonconvex optimization algorithms that are able to save gradient and negative curvature computations to a large extent, and are guaranteed to find an approximate local minimum with ...
Our novel analysis shows that the proposed algorithms can escape the small gradient region in only one negative curvature descent step whenever they enter it, and thus they only need to perform at most ...
As we will prove later in this section, one can use Algorithm 1 to escape a saddle point x with λ min (∇ 2 f (x)) < − H . ...
arXiv:1712.03950v1
fatcat:nv3yjvunxjbczn4xw6hyrx42lu
Gradient Descent Can Take Exponential Time to Escape Saddle Points
[article]
2017
arXiv
pre-print
Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions ...
On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. ...
In particular, we expect that with random initialization, general stochastic gradient descent will need exponential time to escape saddle points in the worst case. ...
arXiv:1705.10412v2
fatcat:ekjhh5ccrnhwncju7gmdrrj2gu
On the diffusion approximation of nonconvex stochastic gradient descent
[article]
2018
arXiv
pre-print
We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. ...
~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. ...
A On weak approximation of diffusion process to stochastic gradient descent In the main text, we have considered the stochastic gradient descent (SGD) iteration x (k) = x (k−1) − η∇f (x (k−1) , ζ k ) , ...
arXiv:1705.07562v2
fatcat:wvsq22vh6rgjzpt76aiqzxpv6q
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
[article]
2021
arXiv
pre-print
more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. ...
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. ...
Stochastic Gradient Noise Analysis. ...
arXiv:2002.03495v14
fatcat:tbavmri37jciziarjz2ybnnt5m
Escaping Saddle-Points Faster under Interpolation-like Conditions
[article]
2020
arXiv
pre-print
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. ...
We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent ( ...
But then it is shown that the stuck region is narrow enough so that the iterates escape the saddle points with high probability. We now require a condition on the tail of the stochastic gradient. ...
arXiv:2009.13016v1
fatcat:z5j64p3jxvamdipaaeqptkclfi
Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization
[article]
2020
arXiv
pre-print
A key insight in these analyses is that gradient perturbations play a critical role in allowing local descent algorithms to efficiently distinguish desirable from undesirable stationary points and escape ...
descent and its variations, perform well in converging towards local minima and avoiding saddle-points. ...
The authors are with the Institute of Electrical Engineering, École Polytechnique Fédérale de Lausanne. Emails: {stefan.vlaski, ali.sayed}@epfl.ch. ...
arXiv:2003.14366v1
fatcat:42vsyhewprcaln2j7365ehb4zi
On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points
[article]
2019
arXiv
pre-print
Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. ...
But these analyses do not take into account the possibility of converging to saddle points. ...
Stochastic setting with Lipschitz gradient. ...
arXiv:1902.04811v2
fatcat:rmdh2zan2vhdxbbzi6if2krnwe
Second-Order Guarantees in Federated Learning
[article]
2020
arXiv
pre-print
We draw on recent results on the second-order optimality of stochastic gradient algorithms in centralized and decentralized settings, and establish second-order guarantees for a class of federated learning ...
Nevertheless, most existing analysis are either limited to convex loss functions, or only establish first-order stationarity, despite the fact that saddle-points, which are first-order stationary, are ...
All implementations escape the saddle-point and find a local minimum. ...
arXiv:2012.01474v1
fatcat:eyxwyialxbcg7dtmywyhnyccgu
« Previous
Showing results 1 — 15 out of 2,638 results