Filters








73,540 Hits in 3.4 sec

On the Convergence of Adam and Beyond [article]

Sashank J. Reddi, Satyen Kale, Sanjiv Kumar
2019 arXiv   pre-print
We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm  ...  Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with 'long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the  ...  Figure 1 : 1 Performance comparison of ADAM and AMSGRAD on synthetic example on a simple one dimensional convex problem inspired by our examples of non-convergence.  ... 
arXiv:1904.09237v1 fatcat:ctg52u4p5fgufdfxrgwmvydvlu

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed [article]

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He
2021 arXiv   pre-print
They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT.  ...  In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam.  ...  We present theoretical analysis on the convergence of 1bit Adam, and show that it admits the same asymptotic convergence rate as the uncompressed one. • We conduct experiments on large scale ML tasks that  ... 
arXiv:2102.02888v2 fatcat:ows54oozbjctbb2dezctowxldq

On the Variance of the Adaptive Learning Rate and Beyond [article]

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han
2021 arXiv   pre-print
and Adam.  ...  We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate.  ...  W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and DTRA HDTRA11810026.  ... 
arXiv:1908.03265v4 fatcat:kqa3woty3rd4lhshba3fjiuxuq

Table errata: ''On comparing Adams and natural spline multistep formulas" (Math. Comp. {\bf 29} (1975), 741–745)

David R. Hill
1976 Mathematics of Computation  
Unfortunately, the only copy of Gosper et al. is not a very good one, the exact number of terms it contains is not even known, and no statistics were compiled concerning it.  ...  The statistical table in the UMT is likewise only approximately true and the fraction beyond 19945 terms is false.  ... 
doi:10.1090/s0025-5718-1976-0386214-2 fatcat:as35qed6szhqrk4lneumqkjpwy

Avoiding local minima in variational quantum eigensolvers with the natural gradient optimizer [article]

David Wierichs, Christian Gogolin, Michael Kastoryano
2020 arXiv   pre-print
The BFGS algorithm is frequently unable to find a global minimum for systems beyond about 20 spins and ADAM easily gets trapped in local minima.  ...  We compare the BFGS optimizer, ADAM and Natural Gradient Descent (NatGrad) in the context of Variational Quantum Eigensolvers (VQEs).  ...  (a) The threshold size beyond which ADAM fails can be shifted by reducing η, delaying local convergence to bigger systems.  ... 
arXiv:2004.14666v2 fatcat:noxvfl5idra2rlgsdpdrbclehq

Understanding and Scheduling Weight Decay [article]

Zeke Xie, Issei Sato, Masashi Sugiyama
2021 arXiv   pre-print
Previous work usually interpreted weight decay as a Gaussian prior from the Bayesian perspective. However, weight decay sometimes shows mysterious behaviors beyond the conventional understanding.  ...  First, we propose a novel theoretical interpretation of weight decay from the perspective of learning dynamics.  ...  ACKNOWLEDGEMENT MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.  ... 
arXiv:2011.11152v4 fatcat:gbuwvxetvnbb5cpa6hfpwsh34u

On the adequacy of untuned warmup for adaptive optimization [article]

Jerry Ma, Denis Yarats
2021 arXiv   pre-print
") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup.  ...  In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability.  ...  RAdam's claimed benefits are its superior performance to Adam and its elimination of costly warmup schedule tuning.  ... 
arXiv:1910.04209v3 fatcat:muqqno55kjfuvakhj3p33tl26u

Table Errata

1976 Mathematics of Computation  
Unfortunately, the only copy of Gosper et al. is not a very good one, the exact number of terms it contains is not even known, and no statistics were compiled concerning it.  ...  Therefore, the continued fraction must be correct to far beyond the result of Choong et al.  ... 
doi:10.1090/s0025-5718-76-99666-6 fatcat:oohirkouvjbgbazemzdux53czi

On Higher-order Moments in Adam [article]

Zhanhong Jiang, Aditya Balu, Sin Yong Tan, Young M Lee, Chinmay Hegde, Soumik Sarkar
2019 arXiv   pre-print
Our analysis and experiments reveal that certain higher-order moments of the stochastic gradient are able to achieve better performance compared to the vanilla Adam algorithm.  ...  In this paper, we investigate the popular deep learning optimization routine, Adam, from the perspective of statistical moments.  ...  There has been a large body of work in the literature that investigates the convergence of Adam as well as how adaptive learning rate benefits the convergence.  ... 
arXiv:1910.06878v1 fatcat:rsctqrgforclznkyh2ppq7kkri

AdaSGD: Bridging the gap between SGD and Adam [article]

Jiaxuan Wang, Jenna Wiens
2020 arXiv   pre-print
On several datasets that span three different domains,we demonstrate how AdaSGD combines the benefits of both SGD and Adam, eliminating the need for approaches that transition from Adam to SGD.  ...  In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving  ...  On the convergence of adam and beyond. ICLR, 2018. Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016. Tieleman, T. and Hinton, G.  ... 
arXiv:2006.16541v1 fatcat:hd2jkgswlzbiffvd4t5xvf7fqy

Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile [article]

Panayotis Mertikopoulos and Bruno Lecouat and Houssam Zenati and Chuan-Sheng Foo and Vijay Chandrasekhar and Georgios Piliouras
2018 arXiv   pre-print
Our analysis generalizes and extends the results of Daskalakis et al. (2018) for optimistic gradient descent (OGD) in bilinear problems, and makes concrete headway for establishing convergence beyond convex-concave  ...  We also provide stochastic analogues of these results, and we validate our analysis by numerical experiments in a wide array of GAN models (including Gaussian mixture models, as well as the CelebA and  ...  For ease of comparison, we provide below a collection of samples generated by Adam and optimistic Adam in the CelebA and CIFAR-datasets.  ... 
arXiv:1807.02629v2 fatcat:va6lywidcfh5xojmoqjgo4nzry

Algorithms for solving optimization problems arising from deep neural net models: nonsmooth problems [article]

Vyacheslav Kungurtsev, Tomas Pevny
2018 arXiv   pre-print
In this paper, we summarize the primary challenges involved, the state of the art, and present some numerical results on an interesting and representative class of problems.  ...  This alone presents a challenge to application and development of appropriate optimization algorithms for solving the problem.  ...  Figure 1 .Figure 2 . 12 Plot comparing the final value after 1000 iterations of ADAM, SFO, and LMBM. Plot comparing the time to convergence of ADAM, SFO, and LMBM.  ... 
arXiv:1807.00173v1 fatcat:cwf5w2kigfchjepw4m7epsjsxy

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm [article]

Hanlin Tang, Shaoduo Gan, Samyam Rajbhandari, Xiangru Lian, Ji Liu, Yuxiong He, Ce Zhang
2020 arXiv   pre-print
We also conduct theoretical analysis on the convergence and efficiency.  ...  The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch.  ...  Beyond Adam, many other strategies that that shares the same idea of changing learning rate dynamically was studied.  ... 
arXiv:2008.11343v2 fatcat:664c57kfnvg7bi2ba6do55z6zm

Accelerating Least Squares Imaging Using Deep Learning Techniques [article]

Janaki Vamaraju, Jeremy Vila, Mauricio Araya-Polo, Debanjan Datta, Mohamed Sidahmed, Mrinal Sen
2019 arXiv   pre-print
The success of the inversion largely depends on our ability to handle large systems of equations given the massive computation costs.  ...  Further, minimizing the Huber loss with mini-batch gradients and Adam optimizer is not only less memory-intensive but is also more robust.  ...  Adam and AdaBound converge within 6 epochs and the cost function becomes flat beyond 10 epochs. However, the image from Adam (MSSIM=0.8255) has a higher MSSIM than AdaBound (MSSIM = 0.8157).  ... 
arXiv:1911.06027v2 fatcat:o4sibaygmfhdzf4lhecddrjraq

Table errata: ''Regular continued fractions for $\pi $ and $\gamma $", (Math. Comp. {\bf 25} (1971), 403); ''Rational approximations to $\pi $" (ibid. {\bf 25} (1971), 387–392) by K. Y. Choong, D. E. Daykin and C. R. Rathbone

D. Shanks
1976 Mathematics of Computation  
Unfortunately, the only copy of Gosper et al. is not a very good one, the exact number of terms it contains is not even known, and no statistics were compiled concerning it.  ...  The statistical table in the UMT is likewise only approximately true and the fraction beyond 19945 terms is false.  ... 
doi:10.1090/s0025-5718-1976-0386215-4 fatcat:sl5yfhd4ercj3oen24fixf6pju
« Previous Showing results 1 — 15 out of 73,540 results