2,162 Hits in 2.8 sec

Batch Normalization Preconditioning for Neural Network Training [article]

Susanna Lange, Kyle Helfrich, Qiang Ye
2022 arXiv   pre-print
Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks.  ...  It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP).  ...  We would also like to thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Lipscomb Compute Cluster  ... 
arXiv:2108.01110v2 fatcat:qjopujyhujfvvh34pgxnxvefmm

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model [article]

Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse
2019 arXiv   pre-print
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns.  ...  We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with.  ...  used neural network training schedules.  ... 
arXiv:1907.04164v2 fatcat:7ys3ibw6vneenjtxtlzhokishy

Online Second Order Methods for Non-Convex Stochastic Optimizations [article]

Xi-Lin Li
2018 arXiv   pre-print
deep neural network learning.  ...  network architectures, e.g., convolutional and recurrent neural networks.  ...  ., step sizes, preconditioned gradient clipping thresholds, mini-batch sizes, neural network initial guesses, training and testing sample sizes, training loss smoothing factor, etc., can be found in our  ... 
arXiv:1803.09383v3 fatcat:j3bcbgerwrco5agmjprgtv26ma

Adaptive Gradient Methods at the Edge of Stability [article]

Jeremy M. Cohen and Behrooz Ghorbani and Shankar Krishnan and Naman Agarwal and Sourabh Medapati and Michal Badura and Daniel Suo and David Cardoze and Zachary Nado and George E. Dahl and Justin Gilmer
2022 arXiv   pre-print
For Adam with step size η and β_1 = 0.9, this stability threshold is 38/η. Similar effects occur during minibatch training, especially as the batch size grows.  ...  Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value – the stability threshold  ...  Most relevant to our paper, [31] conducted a qualitative study of full-batch Adam during neural network training.  ... 
arXiv:2207.14484v1 fatcat:rgqxzxgnsrh7raojtgajg4rtcu

GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training [article]

Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-Yan Liu, Liwei Wang
2021 arXiv   pre-print
In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs.  ...  Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods.  ...  Effective training strategies for deep graph neural networks, 2020a. Zhou, K., Huang, X., Li, Y., Zha, D., Chen, R., and Hu, X.  ... 
arXiv:2009.03294v3 fatcat:meyz6ewugvcivfvmgxfdzuorbm

Adaptively Preconditioned Stochastic Gradient Langevin Dynamics [article]

Chandrasekaran Anirudh Bhardwaj
2019 arXiv   pre-print
Stochastic Gradient Langevin Dynamics infuses isotropic gradient noise to SGD to help navigate pathological curvature in the loss landscape for deep networks.  ...  Isotropic nature of the noise leads to poor scaling, and adaptive methods based on higher order curvature information such as Fisher Scoring have been proposed to precondition the noise in order to achieve  ...  Dropout randomly drops neurons with a probability, and it mimics training an ensemble of neural networks.  ... 
arXiv:1906.04324v2 fatcat:d6s5jypejzhlbnsk4m35mm5bsm

Meta-Learning with Hessian-Free Approach in Deep Neural Nets Training [article]

Boyu Chen, Wenlian Lu, Ernest Fokoue
2018 arXiv   pre-print
Two recurrent neural networks are established to generate the damping and the precondition matrix of this Hessian-Free framework.  ...  Meta-learning is a promising method to achieve efficient training method towards deep neural net and has been attracting increases interests in recent years.  ...  In addition, the learning rate lr for training the target neural network is fixed to lr = b tr /b mt , where b tr is the batch size in target network training and b mt is the batch size in meta-training  ... 
arXiv:1805.08462v2 fatcat:jtji5tikjfdknlla3y4rk3lxem

Scalable Natural Gradient Langevin Dynamics in Practice [article]

Henri Palacci, Henry Hess
2018 arXiv   pre-print
We compare different preconditioning approaches to the normalization of the noise vector and benchmark these approaches on the following criteria: 1) mixing times of the multivariate parameter vector,  ...  Stochastic Gradient Langevin Dynamics (SGLD) is a sampling scheme for Bayesian modeling adapted to large datasets and models.  ...  for large-scale neural networks.  ... 
arXiv:1806.02855v1 fatcat:onjqysti55fnvd2tsbx6xwwmma

GradNets: Dynamic Interpolation Between Neural Architectures [article]

Diogo Almeida, Nate Sauder
2015 arXiv   pre-print
Benefits include increased accuracy, easier convergence with more complex architectures, solutions to test-time execution of batch normalization, and the ability to train networks of up to 200 layers.  ...  Neural Networks, in particular, have enormous expressive power and yet are notoriously challenging to train. The nature of that optimization challenge changes over the course of learning.  ...  ACKNOWLEDGMENTS We thank NVIDIA for their generosity in providing access to part of their cluster in support of Enlitic's mission and our research.  ... 
arXiv:1511.06827v1 fatcat:kkpfwhtlt5anlkc3zitevgcxue

A Loss Curvature Perspective on Training Instability in Deep Learning [article]

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat
2021 arXiv   pre-print
Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor  ...  Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization  ...  Introduction Optimization of neural networks can easily fail.  ... 
arXiv:2110.04369v1 fatcat:ml5q7fdbyjg3nke7hgdq3nxqt4

Neural-network preconditioners for solving the Dirac equation in lattice gauge theory [article]

Salvatore Calì, Daniel C. Hackett, Yin Lin, Phiala E. Shanahan, Brian Xiao
2022 arXiv   pre-print
This work develops neural-network--based preconditioners to accelerate solution of the Wilson-Dirac normal equation in lattice quantum field theories.  ...  In this system, neural-network preconditioners are found to accelerate the convergence of the conjugate gradient solver compared with the solution of unpreconditioned systems or those preconditioned with  ...  ACKNOWLEDGMENTS We thank William Detmold for useful comments on the manuscript. YL is grateful for the dicussions with Andreas Kronfeld. SC, DCH, YL, and PES are sup-  ... 
arXiv:2208.02728v1 fatcat:s5pqklpxxvcphghnsnyskpklea

Three Mechanisms of Weight Decay Regularization [article]

Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse
2018 arXiv   pre-print
Our results provide insight into how to improve the regularization of neural networks.  ...  Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation  ...  ACKNOWLEDGEMENT We thank Jimmy Ba, Kevin Luk, Maxime Gazeau, and Behnam Neyshabur for helpful discussions, and Tianqi Chen and Shengyang Sun for their feedback on early drafts.  ... 
arXiv:1810.12281v1 fatcat:l2zpoupsa5eqjlt5zi6p6cvpiq

Recurrent neural network training with preconditioned stochastic gradient descent [article]

Xi-Lin Li
2016 arXiv   pre-print
This paper studies the performance of a recently proposed preconditioned stochastic gradient descent (PSGD) algorithm on recurrent neural network (RNN) training.  ...  RNNs, especially the ones requiring extremely long term memories, are difficult to train.  ...  We have tested PSGD on eight pathological synthetic recurrent neural network (RNN) training problems.  ... 
arXiv:1606.04449v2 fatcat:r3r66yomynfmrgk5pdba3ytemm

Preconditioned Stochastic Gradient Descent

Xi-Lin Li
2018 IEEE Transactions on Neural Networks and Learning Systems  
network or a recurrent neural network requiring extremely long term memories.  ...  Experimental results demonstrate that equipped with the new preconditioner, without any tuning effort, preconditioned SGD can efficiently solve many challenging problems like the training of a deep neural  ...  However, the neural network trained by preconditioned SGD with preconditioner 3 using a large step size over fits the training data after about two epochs, and the neural network coefficients are pushed  ... 
doi:10.1109/tnnls.2017.2672978 pmid:28362591 fatcat:j3woq662tvfyfmdrjdoxjz65p4

Convolutional Neural Network Training with Distributed K-FAC [article]

J. Gregory Pauloski, Zhao Zhang, Lei Huang, Weijia Xu, Ian T. Foster
2020 arXiv   pre-print
We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale.  ...  Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales.  ...  For example, with the classic image classification problem, a batch size of 32K is considered large for convolutional neural network training with the ImageNet-1k dataset.  ... 
arXiv:2007.00784v1 fatcat:tacioznilvh7locxcqr6mejtt4
« Previous Showing results 1 — 15 out of 2,162 results