A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Improving Generalization Performance by Switching from Adam to SGD
[article]
2017
arXiv
pre-print
Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. ...
These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. ...
To investigate this further, we propose SWATS, a simple strategy that combines the best of both worlds by Switching from Adam to SGD. ...
arXiv:1712.07628v1
fatcat:uksgec7lfnfbjpcjbuz5e2l3mu
An Optimization Strategy Based on Hybrid Algorithm of Adam and SGD
2018
MATEC Web of Conferences
.,2017) proposed a hybrid strategy to start training with Adam and switch to SGD at the right time. ...
Abstract:Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to stochastic gradient descent (SGD). ...
Acknowledgment Note: this paper is supported by the attachment project, project name such as attachment. ...
doi:10.1051/matecconf/201823203007
fatcat:7xo3q7dfr5d37dqdvuzscvlt3u
Exploit Where Optimizer Explores via Residuals
[article]
2020
arXiv
pre-print
To exploit the trajectory of (momentum) stochastic gradient descent (SGD(m)) method, we propose a novel method named SGD(m) with residuals (RSGD(m)), which leads to a performance boost of both the convergence ...
stage, and similar to or better than SGD(m) at the end of training with better generalization error. ...
To address this problem, we propose a novel RSGD(m) algorithm to improve SGD(m) by exploiting the SGD(m) trajectory using residuals. ...
arXiv:2004.05298v2
fatcat:l4k3x5hzqbbm5odhfpwpppcnnm
Logit Attenuating Weight Normalization
[article]
2021
arXiv
pre-print
While LAWN is particularly impressive in improving Adam, it greatly improves all optimizers when used with large batch sizes ...
Although regularization is typically understood from an overfitting perspective, we highlight its role in making the network more adaptive and enabling it to escape more easily from weights that generalize ...
From a theory perspective, weight norm bounding has been shown to be useful for improving generalization [3, 29] . ...
arXiv:2108.05839v1
fatcat:jms43a72pzao5hz5ycaclecuza
Reinforced stochastic gradient descent for deep neural network learning
[article]
2017
arXiv
pre-print
For a benchmark handwritten digits dataset, the learning performance is comparable to Adam, yet with an extra advantage of requiring one-fold less computer memory. ...
Therefore, it is highly desirable to design an efficient algorithm to escape from these saddle points and reach a parameter region of better generalization capabilities. ...
This work was supported by the program for Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) from Japan Agency for Medical Research and development, AMED, and by RIKEN Brain ...
arXiv:1701.07974v5
fatcat:lja5itgbcjfh5p7q53hsrnoonm
A Bounded Scheduling Method for Adaptive Gradient Methods
2019
Applied Sciences
To overcome that, we propose a bounded scheduling algorithm for Adam, which can not only improve the generalization capability but also ensure the convergence. ...
Experimental results show that our method can eliminate the generalization gap between Adam and SGD, meanwhile maintaining a relative high convergence rate during training. ...
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. ...
doi:10.3390/app9173569
fatcat:47vxmawpojbnbe7qdqjwcblkvi
Short-Term Prediction of Bus Passenger Flow Based on a Hybrid Optimized LSTM Network
2019
ISPRS International Journal of Geo-Information
We have also tried combinations of other optimization algorithms and applications in different models, finding that optimizing LSTM by switching Nadam to SGD is the best choice. ...
In particular, the proposed model brings about a 4%–20% extra performance improvements compared with those of non-hybrid LSTM models. ...
Acknowledgments: Thanks for the data provided by Qingdao Public Transportation Group.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/ijgi8090366
fatcat:gzlswd3dqfgvhmwoneg2p3b32y
An empirical analysis of the optimization of deep network loss surfaces
[article]
2017
arXiv
pre-print
To do this, we visualize the loss function by projecting them down to low-dimensional spaces chosen based on the convergence points of different optimization algorithms. ...
The success of deep neural networks hinges on our ability to accurately and efficiently optimize high-dimensional, non-convex functions. ...
NIN: Switching from SGD (S, η = .1) to Adam (A, η = .0001). ) VGG: Switching from ADAM (A, η = .001) to Adadelta (ADE). ) VGG: Switching from Adadelta (ADE) to ADAM (A, η = .001). ...
arXiv:1612.04010v4
fatcat:zi3rg22bxncvjkhpt23usukjoq
An Image Classification Method Based on Deep Neural Network with Energy Model
2018
CMES - Computer Modeling in Engineering & Sciences
optimization algorithms such as SGD, Adam and so on. ...
The novel contributions in this paper are as follows: (1) We improved the stochastic depth network by using deep energy model, which can obtain better generative models when training based on the energy ...
Eq. (3.16) shows the conditions for switching from Adam to SGD. is a noisy estimate of the scaling needed, is the exponential average. ...
doi:10.31614/cmes.2018.04249
fatcat:vytorzfmabhktgocq2mhy7wto4
A Simple Guard for Learned Optimizers
[article]
2022
arXiv
pre-print
., 2020) proposed Safeguarded L2O (GL2O) which can take a learned optimizer and safeguard it with a generic learning algorithm so that by conditionally switching between the two, the resulting algorithm ...
If the trend of learned components eventually outperforming their hand-crafted version continues, learned optimizers will eventually outperform hand-crafted optimizers like SGD or Adam. ...
Acknowledgements We would like to thank Martin Poliak, Nicholas Guttenberg, and David Castillo for their very helpful comments and discussions. ...
arXiv:2201.12426v1
fatcat:tytjnxw24neixazl6dgjeqbm4a
An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs
2021
IEICE transactions on information and systems
Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs ...
Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate. ...
To improve SGD in terms of accuracy and the number of iterations to convergence, variants of SGD have been invented. ...
doi:10.1587/transinf.2021pap0008
fatcat:jt65rtgxmfd4nplooirlfghqum
How Do Adam and Training Strategies Help BNNs Optimization?
[article]
2021
arXiv
pre-print
However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support ...
We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. ...
As a result, "dead" weights from saturation are easier to be re-activated by Adam than SGD. ...
arXiv:2106.11309v1
fatcat:5m3ezp4gxvdsbdg2ptv2udpzmq
Faster Biological Gradient Descent Learning
[article]
2020
arXiv
pre-print
A number of algorithms have been developed to speed up convergence and improve robustness of the learning. ...
However, they are complicated to implement biologically as they require information from previous updates. ...
We would also like to thank University of Nottingham High Performance Computing for providing computing powers for this research. ...
arXiv:2009.12745v1
fatcat:hwohq2pl2zab3iecsl62ilkfdm
AdaSGD: Bridging the gap between SGD and Adam
[article]
2020
arXiv
pre-print
In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving ...
In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam. ...
v Figure 3 . 3 An illustration of why SGD may lead to better generalization performance compared to Adam. ...
arXiv:2006.16541v1
fatcat:hd2jkgswlzbiffvd4t5xvf7fqy
Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English
[article]
2020
arXiv
pre-print
This performance betters the top ranked classifier on this dataset by a wide margin. ...
Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex. ...
Theedhum Nandrum also benefited from open source contributions of several people. ...
arXiv:2010.03189v2
fatcat:6siktgnrqzhkhdsgirw5prz3ly
« Previous
Showing results 1 — 15 out of 2,191 results