2,191 Hits in 5.4 sec

Improving Generalization Performance by Switching from Adam to SGD [article]

Nitish Shirish Keskar, Richard Socher
2017 arXiv   pre-print
Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied.  ...  These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training.  ...  To investigate this further, we propose SWATS, a simple strategy that combines the best of both worlds by Switching from Adam to SGD.  ... 
arXiv:1712.07628v1 fatcat:uksgec7lfnfbjpcjbuz5e2l3mu

An Optimization Strategy Based on Hybrid Algorithm of Adam and SGD

Yijun Wang, Pengyu Zhou, Wenya Zhong, Yansong Wang
2018 MATEC Web of Conferences  
.,2017) proposed a hybrid strategy to start training with Adam and switch to SGD at the right time.  ...  Abstract:Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to stochastic gradient descent (SGD).  ...  Acknowledgment Note: this paper is supported by the attachment project, project name such as attachment.  ... 
doi:10.1051/matecconf/201823203007 fatcat:7xo3q7dfr5d37dqdvuzscvlt3u

Exploit Where Optimizer Explores via Residuals [article]

An Xu, Zhouyuan Huo, Heng Huang
2020 arXiv   pre-print
To exploit the trajectory of (momentum) stochastic gradient descent (SGD(m)) method, we propose a novel method named SGD(m) with residuals (RSGD(m)), which leads to a performance boost of both the convergence  ...  stage, and similar to or better than SGD(m) at the end of training with better generalization error.  ...  To address this problem, we propose a novel RSGD(m) algorithm to improve SGD(m) by exploiting the SGD(m) trajectory using residuals.  ... 
arXiv:2004.05298v2 fatcat:l4k3x5hzqbbm5odhfpwpppcnnm

Logit Attenuating Weight Normalization [article]

Aman Gupta, Rohan Ramanath, Jun Shi, Anika Ramachandran, Sirou Zhou, Mingzhou Zhou, S. Sathiya Keerthi
2021 arXiv   pre-print
While LAWN is particularly impressive in improving Adam, it greatly improves all optimizers when used with large batch sizes  ...  Although regularization is typically understood from an overfitting perspective, we highlight its role in making the network more adaptive and enabling it to escape more easily from weights that generalize  ...  From a theory perspective, weight norm bounding has been shown to be useful for improving generalization [3, 29] .  ... 
arXiv:2108.05839v1 fatcat:jms43a72pzao5hz5ycaclecuza

Reinforced stochastic gradient descent for deep neural network learning [article]

Haiping Huang, Taro Toyoizumi
2017 arXiv   pre-print
For a benchmark handwritten digits dataset, the learning performance is comparable to Adam, yet with an extra advantage of requiring one-fold less computer memory.  ...  Therefore, it is highly desirable to design an efficient algorithm to escape from these saddle points and reach a parameter region of better generalization capabilities.  ...  This work was supported by the program for Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) from Japan Agency for Medical Research and development, AMED, and by RIKEN Brain  ... 
arXiv:1701.07974v5 fatcat:lja5itgbcjfh5p7q53hsrnoonm

A Bounded Scheduling Method for Adaptive Gradient Methods

Mingxing Tang, Zhen Huang, Yuan Yuan, Changjian Wang, Yuxing Peng
2019 Applied Sciences  
To overcome that, we propose a bounded scheduling algorithm for Adam, which can not only improve the generalization capability but also ensure the convergence.  ...  Experimental results show that our method can eliminate the generalization gap between Adam and SGD, meanwhile maintaining a relative high convergence rate during training.  ...  The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.  ... 
doi:10.3390/app9173569 fatcat:47vxmawpojbnbe7qdqjwcblkvi

Short-Term Prediction of Bus Passenger Flow Based on a Hybrid Optimized LSTM Network

Yong Han, Cheng Wang, Yibin Ren, Shukang Wang, Huangcheng Zheng, Ge Chen
2019 ISPRS International Journal of Geo-Information  
We have also tried combinations of other optimization algorithms and applications in different models, finding that optimizing LSTM by switching Nadam to SGD is the best choice.  ...  In particular, the proposed model brings about a 4%–20% extra performance improvements compared with those of non-hybrid LSTM models.  ...  Acknowledgments: Thanks for the data provided by Qingdao Public Transportation Group. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/ijgi8090366 fatcat:gzlswd3dqfgvhmwoneg2p3b32y

An empirical analysis of the optimization of deep network loss surfaces [article]

Daniel Jiwoong Im, Michael Tao, Kristin Branson
2017 arXiv   pre-print
To do this, we visualize the loss function by projecting them down to low-dimensional spaces chosen based on the convergence points of different optimization algorithms.  ...  The success of deep neural networks hinges on our ability to accurately and efficiently optimize high-dimensional, non-convex functions.  ...  NIN: Switching from SGD (S, η = .1) to Adam (A, η = .0001). ) VGG: Switching from ADAM (A, η = .001) to Adadelta (ADE). ) VGG: Switching from Adadelta (ADE) to ADAM (A, η = .001).  ... 
arXiv:1612.04010v4 fatcat:zi3rg22bxncvjkhpt23usukjoq

An Image Classification Method Based on Deep Neural Network with Energy Model

Yang Yang, Jinbao Duan, Haitao Yu, Zhipeng Gao, Xuesong Qiu
2018 CMES - Computer Modeling in Engineering & Sciences  
optimization algorithms such as SGD, Adam and so on.  ...  The novel contributions in this paper are as follows: (1) We improved the stochastic depth network by using deep energy model, which can obtain better generative models when training based on the energy  ...  Eq. (3.16) shows the conditions for switching from Adam to SGD. is a noisy estimate of the scaling needed, is the exponential average.  ... 
doi:10.31614/cmes.2018.04249 fatcat:vytorzfmabhktgocq2mhy7wto4

A Simple Guard for Learned Optimizers [article]

Isabeau Prémont-Schwarz, Jaroslav Vítků, Jan Feyereisl
2022 arXiv   pre-print
., 2020) proposed Safeguarded L2O (GL2O) which can take a learned optimizer and safeguard it with a generic learning algorithm so that by conditionally switching between the two, the resulting algorithm  ...  If the trend of learned components eventually outperforming their hand-crafted version continues, learned optimizers will eventually outperform hand-crafted optimizers like SGD or Adam.  ...  Acknowledgements We would like to thank Martin Poliak, Nicholas Guttenberg, and David Castillo for their very helpful comments and discussions.  ... 
arXiv:2201.12426v1 fatcat:tytjnxw24neixazl6dgjeqbm4a

An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI
2021 IEICE transactions on information and systems  
Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs  ...  Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.  ...  To improve SGD in terms of accuracy and the number of iterations to convergence, variants of SGD have been invented.  ... 
doi:10.1587/transinf.2021pap0008 fatcat:jt65rtgxmfd4nplooirlfghqum

How Do Adam and Training Strategies Help BNNs Optimization? [article]

Zechun Liu, Zhiqiang Shen, Shichao Li, Koen Helwegen, Dong Huang, Kwang-Ting Cheng
2021 arXiv   pre-print
However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support  ...  We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability.  ...  As a result, "dead" weights from saturation are easier to be re-activated by Adam than SGD.  ... 
arXiv:2106.11309v1 fatcat:5m3ezp4gxvdsbdg2ptv2udpzmq

Faster Biological Gradient Descent Learning [article]

Ho Ling Li
2020 arXiv   pre-print
A number of algorithms have been developed to speed up convergence and improve robustness of the learning.  ...  However, they are complicated to implement biologically as they require information from previous updates.  ...  We would also like to thank University of Nottingham High Performance Computing for providing computing powers for this research.  ... 
arXiv:2009.12745v1 fatcat:hwohq2pl2zab3iecsl62ilkfdm

AdaSGD: Bridging the gap between SGD and Adam [article]

Jiaxuan Wang, Jenna Wiens
2020 arXiv   pre-print
In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving  ...  In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam.  ...  v Figure 3 . 3 An illustration of why SGD may lead to better generalization performance compared to Adam.  ... 
arXiv:2006.16541v1 fatcat:hd2jkgswlzbiffvd4t5xvf7fqy

Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English [article]

BalaSundaraRaman Lakshmanan, Sanjeeth Kumar Ravindranath
2020 arXiv   pre-print
This performance betters the top ranked classifier on this dataset by a wide margin.  ...  Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex.  ...  Theedhum Nandrum also benefited from open source contributions of several people.  ... 
arXiv:2010.03189v2 fatcat:6siktgnrqzhkhdsgirw5prz3ly
« Previous Showing results 1 — 15 out of 2,191 results