Filters








1,050 Hits in 6.2 sec

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima [article]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang
2017 arXiv   pre-print
training and testing functions - and as is well known, sharp minima lead to poorer generalization.  ...  The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks.  ...  We conclude with open questions concerning the generalization gap, sharp minima, and possible modifications to make large-batch training viable.  ... 
arXiv:1609.04836v2 fatcat:gniwsh3bz5bhxgq7kxke27nwyi

SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning [article]

Wei Wen, Yandan Wang, Feng Yan, Cong Xu, Chunpeng Wu, Yiran Chen, Hai Li
2018 arXiv   pre-print
In a variety of experiments, SmoothOut and AdaSmoothOut consistently improve generalization in both small-batch and large-batch training on the top of state-of-the-art solutions.  ...  In Deep Learning, Stochastic Gradient Descent (SGD) is usually selected as a training method because of its efficiency; however, recently, a problem in SGD gains research interest: sharp minima in Deep  ...  Our approach is based on the second hypothesis, targeting on escaping sharp minima for better generalization in both small-batch and large-batch SGD.  ... 
arXiv:1805.07898v3 fatcat:2un55cf2yjcddksdauogyvfiqm

Large-Scale Deep Learning Optimizations: A Comprehensive Survey [article]

Xiaoxin He, Fuzhao Xue, Xiaozhe Ren, Yang You
2021 arXiv   pre-print
We investigate algorithms that are most commonly used for optimizing, elaborate the debatable topic of generalization gap arises in large-batch training, and review the SOTA strategies in addressing the  ...  However, we generally spend longer training time on more computation and communication.  ...  Generalization Gap and Sharp Minima With regards to large batch training, Keskar et al.  ... 
arXiv:2111.00856v2 fatcat:njjaygney5bqpo6ekyntgcjaiq

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [article]

Zeke Xie, Issei Sato, Masashi Sugiyama
2021 arXiv   pre-print
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate.  ...  Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well.  ...  Acknowledgement I would like to express my deep gratitude to Professor Masashi Sugiyama and Professor Issei Sato for their patient guidance and useful critiques of this research work.  ... 
arXiv:2002.03495v14 fatcat:tbavmri37jciziarjz2ybnnt5m

Spectral Norm Regularization for Improving the Generalizability of Deep Learning [article]

Yuichi Yoshida, Takeru Miyato
2017 arXiv   pre-print
We investigate the generalizability of deep learning based on the sensitivity to input perturbation.  ...  We provide supportive evidence for the abovementioned hypothesis by experimentally confirming that the models trained using spectral norm regularization exhibit better generalizability than other baseline  ...  small-batch regime and 0.1 in the large-batch regime for the VGGNet and DenseNet models on the CIFAR-10 dataset.  ... 
arXiv:1705.10941v1 fatcat:qwov5slp35hklmq25od72bj5qy

Generalization Error in Deep Learning [article]

Daniel Jakubovitz, Raja Giryes, Miguel R. D. Rodrigues
2019 arXiv   pre-print
In this article, we provide an overview of the existing theory and bounds for the characterization of the generalization error of deep neural networks, combining both classical and more recent theoretical  ...  Thus, an important question is what makes deep neural networks able to generalize well from the training set to new data.  ...  On large-batch training for deep learning: generalization gap and sharp minima In [14] another point of view on stochastic gradient methods is taken through the examination of the effect of the size  ... 
arXiv:1808.01174v3 fatcat:yjem7ahdhbfg5glo2liadysrje

Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory [article]

Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein
2020 arXiv   pre-print
not optimal for generalization; (3) demonstrate that ResNets do not conform to wide-network theories, such as the neural tangent kernel, and that the interaction between skip connections and batch normalization  ...  In this work, we: (1) prove the widespread existence of suboptimal local minima in the loss landscape of neural networks, and we use our theory to find examples; (2) show that small-norm parameters are  ...  ACKNOWLEDGMENTS This work was supported by the AFOSR MURI Program, the National Science Foundation DMS directorate, and also the DARPA YFA and L2M programs.  ... 
arXiv:1910.00359v3 fatcat:oas2iunoyfantiepiklcz5pude

The large learning rate phase of deep learning: the catapult mechanism [article]

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari
2020 arXiv   pre-print
One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.  ...  In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.  ...  Acknowledgements The authors would like to thank Kyle Aitken, Dar Gilboa, Justin Gilmer, Boris Hanin, Tengyu Ma, Andrea Montanari, and Behnam Neyshabur for useful discussions.  ... 
arXiv:2003.02218v1 fatcat:t5brjhbb3ffwxlutrqlhwliypu

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates [article]

Leslie N. Smith, Nicholay Topin
2018 arXiv   pre-print
One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate.  ...  The existence of super-convergence is relevant to understanding why deep networks generalize well.  ...  On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. Diederik Kingma and Jimmy Ba.  ... 
arXiv:1708.07120v3 fatcat:ff5p24boonbzjm7qf4pt4vabqq

Super-convergence: very fast training of neural networks using large learning rates

Leslie N. Smith, Nicholay Topin, Tien Pham
2019 Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications  
One of the key elements of super-convergence is training with cyclical learning rates and a large maximum learning rate.  ...  The existence of super-convergence is relevant to understanding why deep networks generalize well.  ...  Keskar et al. (2016) study the generalization gap between small and large mini-batches, stating that small mini-batch sizes lead to wide, flat minima and large batch sizes lead to sharp minima.  ... 
doi:10.1117/12.2520589 fatcat:jvkiuhrajrf2plx2sabrnu4zee

LRTuner: A Learning Rate Tuner for Deep Neural Networks [article]

Nikhil Iyer, V Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu
2021 arXiv   pre-print
One very important hyperparameter for training deep neural networks is the learning rate schedule of the optimizer.  ...  The kind of minima attained has a significant impact on the generalization accuracy of the network.  ...  On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. Alex Krizhevsky, Geoffrey Hinton, et al.  ... 
arXiv:2105.14526v1 fatcat:nrr7i265kjhgjjunvfsrm6lvgq

Improving Generalization in Federated Learning by Seeking Flat Minima [article]

Debora Caldarola, Barbara Caputo, Marco Ciccone
2022 arXiv   pre-print
(ASAM) and ii) averaging stochastic weights (SWA) on the server-side can substantially improve generalization in Federated Learning and help bridging the gap with centralized models.  ...  Motivated by prior studies connecting the sharpness of the loss surface and the generalization gap, we show that i) training clients locally with Sharpness-Aware Minimization (SAM) or its adaptive version  ...  Acknowledgments We thank Lidia Fantauzzo for her valuable help and support in running the semantic segmentation experiments.  ... 
arXiv:2203.11834v2 fatcat:tsj7gdaanrdspfo77qs655tjv4

Not all noise is accounted equally: How differentially private learning benefits from large sampling rates [article]

Friedrich Dörmann, Osvald Frisk, Lars Nørvang Andersen, Christian Fischer Pedersen
2021 arXiv   pre-print
Learning often involves sensitive data and as such, privacy preserving extensions to Stochastic Gradient Descent (SGD) and other machine learning algorithms have been developed using the definitions of  ...  Given this observation, we propose a training paradigm that shifts the proportions of noise towards less inherent and more additive noise, such that more of the overall noise can be accounted for in the  ...  Concretely, it has been shown to act as regularization and enable the algorithm to "escape" from sharp, bad-generalizing minima and descend to "flat", well-generalizing minima, meaning that it serves as  ... 
arXiv:2110.06255v1 fatcat:yl6vfurnrngfvh2d45o632hbjy

Deep Bilevel Learning [chapter]

Simon Jenni, Paolo Favaro
2018 Lecture Notes in Computer Science  
The overfitting is controlled by introducing weights on each mini-batch in the training set and by choosing their values so that they minimize the error on the validation set.  ...  In practice, these weights define mini-batch learning rates in a gradient descent update equation that favor gradients with better generalization capabilities.  ...  Deep Bilevel Learning  ... 
doi:10.1007/978-3-030-01249-6_38 fatcat:hy6tdc6cvzhqpih5qt7x3jqnra

Extrapolation for Large-batch Training in Deep Learning [article]

Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi
2020 arXiv   pre-print
A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap).  ...  Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training  ...  ., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016. Kleinberg, R., Li, Y., and Yuan, Y.  ... 
arXiv:2006.05720v1 fatcat:yz2d4stqrjgtbokc5eyjm6fati
« Previous Showing results 1 — 15 out of 1,050 results