Filters








83,523 Hits in 3.1 sec

An Empirical Model of Large-Batch Training [article]

Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team
2018 arXiv   pre-print
In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency.  ...  Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.  ...  We would also like to thank Chris Berner, Chris Hesse, and Eric Sigler for their work on our training infrastructure.  ... 
arXiv:1812.06162v1 fatcat:ev7m3777lbfafjinfwitsv2p5u

Train longer, generalize better: closing the generalization gap in large batch training of neural networks [article]

Elad Hoffer, Itay Hubara, Daniel Soudry
2018 arXiv   pre-print
We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization  ...  Deep learning models are typically trained using stochastic gradient descent or one of its variants.  ...  , of IARPA, DoI/IBC, or the U.S.  ... 
arXiv:1705.08741v2 fatcat:rx2j7sbljndfblvtqerco52jri

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent [article]

Noah Golmant and Nikita Vemuri and Zhewei Yao and Vladimir Feinberg and Amir Gholami and Kai Rothauge and Michael W. Mahoney and Joseph Gonzalez
2018 arXiv   pre-print
We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains  ...  We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break  ...  Indeed, an enormous amount of work has gone into designing systems that seem to operate under an assumption that equates large batch size training with machine learning at scale [9, 15, 24] .  ... 
arXiv:1811.12941v1 fatcat:7bsmkhwnpbcr3cocj7vehj3eda

Data optimization for large batch distributed training of deep neural networks [article]

Shubhankar Gahlot, Junqi Yin, Mallikarjun Shankar
2020 arXiv   pre-print
The current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale, and model accuracy deterioration with an increase in global  ...  Our approach filters out data points which are less important to feature learning, enabling us to speed up the training of models on larger batch sizes to improved accuracy.  ...  Tuning Horovod parameters for 96 GPUs presents only marginal improvements and DDP demonstrates an overall superior scaling for large batch distributed training of ResNet models. V.  ... 
arXiv:2012.09272v2 fatcat:hrgrznaeefefpkesxlgvuiycou

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error [article]

Stanislav Fort, Andrew Brock, Razvan Pascanu, Soham De, Samuel L. Smith
2022 arXiv   pre-print
In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets.  ...  We demonstrate drawing multiple samples per image consistently enhances the test accuracy achieved for both small and large batch training.  ...  Acknowledgements We thank Yee Whye Teh, Karen Simonyan, Zahra Ahmed and Hyunjik Kim for helpful advice, and Matthias Bauer for feedback on an earlier draft of the manuscript.  ... 
arXiv:2105.13343v2 fatcat:ce6zpqzqevbxzbefawecccm4oq

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training [article]

Shen-Yi Zhao, Yin-Peng Xie, Wu-Jun Li
2020 arXiv   pre-print
Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy.  ...  Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.  ...  ., 2016) for training the two models on CIFAR10 is using MSGD with a weight decay of 0.0001, a batch size of 128, an initial learning rate of 0.1, and dividing the learning rate at the 80th and 120th  ... 
arXiv:2007.13985v1 fatcat:dpfahtshefezjg4zjtzfkwmjlu

Parameter Re-Initialization through Cyclical Batch Size Schedules [article]

Norman Mu and Zhewei Yao and Amir Gholami and Kurt Keutzer and Michael Mahoney
2018 arXiv   pre-print
We implement this through a cyclical batch size schedule motivated by a Bayesian perspective of neural network training.  ...  We demonstrate the ability of our method to improve language modeling performance by up to 7.91 perplexity and reduce training iterations by up to 61%, in addition to its flexibility in enabling snapshot  ...  In all language modeling CBS experiments, we use an initial batch size of 10, that is, half the baseline batch size as reported in the respective papers of each baseline model tested.  ... 
arXiv:1812.01216v1 fatcat:mkk5auxhunerdhd6u5zrrgbjmq

Towards Efficient and Scalable Sharpness-Aware Minimization [article]

Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, Yang You
2022 arXiv   pre-print
Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as  ...  To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform experiments in the large-batch training scenario, which is more prone to converge to  ...  Large-batch training is an important direction for distributed machine learning, which can improve the utiliza-tion of large-scale clusters and accelerate the training process.  ... 
arXiv:2203.02714v1 fatcat:22qd4kyderbulb3mmevjubchuq

A Loss Curvature Perspective on Training Instability in Deep Learning [article]

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat
2021 arXiv   pre-print
Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training  ...  Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid -- or navigate out of -- regions of high curvature and into flatter regions  ...  Instead, an empirically corrected threshold 40/η seems to fit the data better.  ... 
arXiv:2110.04369v1 fatcat:ml5q7fdbyjg3nke7hgdq3nxqt4

An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation [article]

Makoto Morishita, Yusuke Oda, Graham Neubig, Koichiro Yoshino, Katsuhito Sudoh, Satoshi Nakamura
2017 arXiv   pre-print
Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes.  ...  However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated  ...  Acknowledgments This work was done as a part of the joint research project with NTT and Nara Institute of Science and Technology.  ... 
arXiv:1706.05765v1 fatcat:rcp6u4vsibah3kps22b76xuddu

An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation

Makoto Morishita, Yusuke Oda, Graham Neubig, Koichiro Yoshino, Katsuhito Sudoh, Satoshi Nakamura
2017 Proceedings of the First Workshop on Neural Machine Translation  
Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes.  ...  However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated  ...  Acknowledgments This work was done as a part of the joint research project with NTT and Nara Institute of Science and Technology.  ... 
doi:10.18653/v1/w17-3208 dblp:conf/aclnmt/MorishitaONYSN17 fatcat:qfczkqx2e5eodhvlydm6527ovi

DC-MMD-GAN: A New Maximum Mean Discrepancy Generative Adversarial Network Using Divide and Conquer

Zhaokun Zhou, Yuanhong Zhong, Xiaoming Liu, Qiang Li, Shu Han
2020 Applied Sciences  
We propose an efficient divide-and-conquer model, called DC-MMD-GANs, which constrains the loss function of MMD to tight bound on the deviation between empirical estimate and expected value of MMD and  ...  However, the loss function of MMD-GANs is an empirical estimate of maximum mean discrepancy (MMD) and not precise in measuring the distance between sample distributions, which inhibits MMD-GANs training  ...  We find that the B-test [31] can obtain a more precise empirical estimate of MMD by computing an average over empirical estimates calculated on subsets.  ... 
doi:10.3390/app10186405 fatcat:jyra3w3lkzhodhfuq45arrfn2u

Curriculum Adversarial Training

Qi-Zhi Cai, Chang Liu, Dawn Song
2018 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence  
The state-of-the-art result on defense shows that adversarial training can be applied to train a robust model on MNIST against adversarial examples; but it fails to achieve a high empirical worst-case  ...  With two techniques to mitigate the catastrophic forgetting and the generalization issues, we demonstrate that CAT can improve the prior art's empirical worst-case accuracy by a large margin of 25% on  ...  ., 2018] provide two insights on why previous adversarial training approaches cannot train a robust model: (1) the model should have a sufficiently large capacity; and (2) strong attacks should be used  ... 
doi:10.24963/ijcai.2018/520 dblp:conf/ijcai/CaiLS18 fatcat:ceytvte6mvftplgxixmccgfeme

An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise [article]

Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, Jimmy Ba
2020 arXiv   pre-print
Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in  ...  To address the problem of improving generalization while maintaining optimal convergence in large-batch training, we propose to add covariance noise to the gradients.  ...  Secondly, from Fig. 3 , we find empirically that the training dynamics of a large-batch regime with empirical Fisher is very close to a small-batch regime (which by the above analysis should be captured  ... 
arXiv:1902.08234v4 fatcat:656pntkmmnhcldlmtgzxktoniy

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio [chapter]

Stanislaw Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
2018 Lecture Notes in Computer Science  
We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size.  ...  We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD.  ...  was run for 200 epochs in which most models reached an accuracy of almost 100% on the training set.  ... 
doi:10.1007/978-3-030-01424-7_39 fatcat:q6sk2gbltnhkli2gc4qwujf25i
« Previous Showing results 1 — 15 out of 83,523 results