A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
[article]
2017
arXiv
pre-print
training and testing functions - and as is well known, sharp minima lead to poorer generalization. ...
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. ...
We conclude with open questions concerning the generalization gap, sharp minima, and possible modifications to make large-batch training viable. ...
arXiv:1609.04836v2
fatcat:gniwsh3bz5bhxgq7kxke27nwyi
SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning
[article]
2018
arXiv
pre-print
In a variety of experiments, SmoothOut and AdaSmoothOut consistently improve generalization in both small-batch and large-batch training on the top of state-of-the-art solutions. ...
In Deep Learning, Stochastic Gradient Descent (SGD) is usually selected as a training method because of its efficiency; however, recently, a problem in SGD gains research interest: sharp minima in Deep ...
Our approach is based on the second hypothesis, targeting on escaping sharp minima for better generalization in both small-batch and large-batch SGD. ...
arXiv:1805.07898v3
fatcat:2un55cf2yjcddksdauogyvfiqm
Large-Scale Deep Learning Optimizations: A Comprehensive Survey
[article]
2021
arXiv
pre-print
We investigate algorithms that are most commonly used for optimizing, elaborate the debatable topic of generalization gap arises in large-batch training, and review the SOTA strategies in addressing the ...
However, we generally spend longer training time on more computation and communication. ...
Generalization Gap and Sharp Minima With regards to large batch training, Keskar et al. ...
arXiv:2111.00856v2
fatcat:njjaygney5bqpo6ekyntgcjaiq
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
[article]
2021
arXiv
pre-print
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. ...
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. ...
Acknowledgement I would like to express my deep gratitude to Professor Masashi Sugiyama and Professor Issei Sato for their patient guidance and useful critiques of this research work. ...
arXiv:2002.03495v14
fatcat:tbavmri37jciziarjz2ybnnt5m
Spectral Norm Regularization for Improving the Generalizability of Deep Learning
[article]
2017
arXiv
pre-print
We investigate the generalizability of deep learning based on the sensitivity to input perturbation. ...
We provide supportive evidence for the abovementioned hypothesis by experimentally confirming that the models trained using spectral norm regularization exhibit better generalizability than other baseline ...
small-batch regime and 0.1 in the large-batch regime for the VGGNet and DenseNet models on the CIFAR-10 dataset. ...
arXiv:1705.10941v1
fatcat:qwov5slp35hklmq25od72bj5qy
Generalization Error in Deep Learning
[article]
2019
arXiv
pre-print
In this article, we provide an overview of the existing theory and bounds for the characterization of the generalization error of deep neural networks, combining both classical and more recent theoretical ...
Thus, an important question is what makes deep neural networks able to generalize well from the training set to new data. ...
On large-batch training for deep learning: generalization gap and sharp minima In [14] another point of view on stochastic gradient methods is taken through the examination of the effect of the size ...
arXiv:1808.01174v3
fatcat:yjem7ahdhbfg5glo2liadysrje
Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory
[article]
2020
arXiv
pre-print
not optimal for generalization; (3) demonstrate that ResNets do not conform to wide-network theories, such as the neural tangent kernel, and that the interaction between skip connections and batch normalization ...
In this work, we: (1) prove the widespread existence of suboptimal local minima in the loss landscape of neural networks, and we use our theory to find examples; (2) show that small-norm parameters are ...
ACKNOWLEDGMENTS This work was supported by the AFOSR MURI Program, the National Science Foundation DMS directorate, and also the DARPA YFA and L2M programs. ...
arXiv:1910.00359v3
fatcat:oas2iunoyfantiepiklcz5pude
The large learning rate phase of deep learning: the catapult mechanism
[article]
2020
arXiv
pre-print
One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. ...
In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice. ...
Acknowledgements The authors would like to thank Kyle Aitken, Dar Gilboa, Justin Gilmer, Boris Hanin, Tengyu Ma, Andrea Montanari, and Behnam Neyshabur for useful discussions. ...
arXiv:2003.02218v1
fatcat:t5brjhbb3ffwxlutrqlhwliypu
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
[article]
2018
arXiv
pre-print
One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. ...
The existence of super-convergence is relevant to understanding why deep networks generalize well. ...
On large-batch training for deep learning: Generalization gap and sharp minima. arXiv
preprint arXiv:1609.04836, 2016.
Diederik Kingma and Jimmy Ba. ...
arXiv:1708.07120v3
fatcat:ff5p24boonbzjm7qf4pt4vabqq
Super-convergence: very fast training of neural networks using large learning rates
2019
Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications
One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate. ...
The existence of super-convergence is relevant to understanding why deep networks generalize well. ...
Keskar et al. (2016) study the generalization gap between small and large mini-batches, stating that small mini-batch sizes lead to wide, flat minima and large batch sizes lead to sharp minima. ...
doi:10.1117/12.2520589
fatcat:jvkiuhrajrf2plx2sabrnu4zee
LRTuner: A Learning Rate Tuner for Deep Neural Networks
[article]
2021
arXiv
pre-print
One very important hyperparameter for training deep neural networks is the learning rate schedule of the optimizer. ...
The kind of minima attained has a significant impact on the generalization accuracy of the network. ...
On large-batch training for deep learning: Generalization gap and sharp
minima. arXiv preprint arXiv:1609.04836, 2016.
Alex Krizhevsky, Geoffrey Hinton, et al. ...
arXiv:2105.14526v1
fatcat:nrr7i265kjhgjjunvfsrm6lvgq
Improving Generalization in Federated Learning by Seeking Flat Minima
[article]
2022
arXiv
pre-print
(ASAM) and ii) averaging stochastic weights (SWA) on the server-side can substantially improve generalization in Federated Learning and help bridging the gap with centralized models. ...
Motivated by prior studies connecting the sharpness of the loss surface and the generalization gap, we show that i) training clients locally with Sharpness-Aware Minimization (SAM) or its adaptive version ...
Acknowledgments We thank Lidia Fantauzzo for her valuable help and support in running the semantic segmentation experiments. ...
arXiv:2203.11834v2
fatcat:tsj7gdaanrdspfo77qs655tjv4
Not all noise is accounted equally: How differentially private learning benefits from large sampling rates
[article]
2021
arXiv
pre-print
Learning often involves sensitive data and as such, privacy preserving extensions to Stochastic Gradient Descent (SGD) and other machine learning algorithms have been developed using the definitions of ...
Given this observation, we propose a training paradigm that shifts the proportions of noise towards less inherent and more additive noise, such that more of the overall noise can be accounted for in the ...
Concretely, it has been shown to act as regularization and enable the algorithm to "escape" from sharp, bad-generalizing minima and descend to "flat", well-generalizing minima, meaning that it serves as ...
arXiv:2110.06255v1
fatcat:yl6vfurnrngfvh2d45o632hbjy
Deep Bilevel Learning
[chapter]
2018
Lecture Notes in Computer Science
The overfitting is controlled by introducing weights on each mini-batch in the training set and by choosing their values so that they minimize the error on the validation set. ...
In practice, these weights define mini-batch learning rates in a gradient descent update equation that favor gradients with better generalization capabilities. ...
Deep Bilevel Learning ...
doi:10.1007/978-3-030-01249-6_38
fatcat:hy6tdc6cvzhqpih5qt7x3jqnra
Extrapolation for Large-batch Training in Deep Learning
[article]
2020
arXiv
pre-print
A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). ...
Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training ...
., and Tang, P. T. P. On large-batch training for deep
learning: Generalization gap and sharp minima. arXiv
preprint arXiv:1609.04836, 2016.
Kleinberg, R., Li, Y., and Yuan, Y. ...
arXiv:2006.05720v1
fatcat:yz2d4stqrjgtbokc5eyjm6fati
« Previous
Showing results 1 — 15 out of 1,050 results