On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [article]

Dongruo Zhou and Yiqi Tang and Ziyan Yang and Yuan Cao and Quanquan Gu
2018 arXiv   pre-print
Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been sufficiently studied. In this paper, we provide a sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases. Our analysis shows that, for smooth
more » ... nvex functions, Padam converges to a first-order stationary point at the rate of O((∑_i=1^dg_1:T,i_2)^1/2/T^3/4 + d/T), where T is the number of iterations, d is the dimension, g_1,...,g_T are the stochastic gradients, and g_1:T,i = [g_1,i,g_2,i,...,g_T,i]^. Our theoretical result also suggests that in order to achieve faster convergence rate, it is necessary to use Padam instead of AMSGrad. This is well-aligned with the empirical results of deep learning reported in Chen and Gu (2018).
arXiv:1808.05671v2 fatcat:k437chfxc5erxosw4or7p75djy