112,489 Hits in 5.9 sec

How Does Batch Normalization Help Optimization? [article]

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry
2019 arXiv   pre-print
Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs).  ...  Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother.  ...  Acknowledgements We thank Ali Rahimi and Ben Recht for helpful comments on a preliminary version of this paper.  ... 
arXiv:1805.11604v5 fatcat:zj6ybdoo3rdbldv2idlshwwzj4

How Does Batch Normalization Help Binary Training? [article]

Eyyüb Sari, Mouloud Belbahri, Vahid Partovi Nia
2020 arXiv   pre-print
It appears in practice that BNNs fail to train in the absence of Batch Normalization (BatchNorm) layer. We find the main role of BatchNorm is to avoid exploding gradients in the case of BNNs.  ...  Most of binary training in convolutional models include Batch Normalization (BatchNorm) layer (Ioffe and Szegedy, 2015) .  ...  It is natural to take a step back and wonder how important BatchNorm component is.  ... 
arXiv:1909.09139v3 fatcat:7qztdfjizrc4pnlgracihty5vq

Batch size-invariance for policy optimization [article]

Jacob Hilton, Karl Cobbe, John Schulman
2022 arXiv   pre-print
Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.  ...  However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates.  ...  Acknowledgments We thank David Farhi, Chris Hardin and Holly Mandel for helpful discussions and comments, and anonymous reviewers for helpful and detailed feedback.  ... 
arXiv:2110.00641v2 fatcat:ja2gypl55nhndeowjtnlgl3yji

Importance Sampled Stochastic Optimization for Variational Inference [article]

Joseph Sakaya, Arto Klami
2017 arXiv   pre-print
We show how the gradient with respect to the approximation parameters can often be evaluated efficiently without needing to re-compute gradients of the model itself, and then proceed to derive practical  ...  For all three choices the I-SGD algorithm with t = 0.9 converges to the same optimal solution as SGD, but does so in roughly an order of magnitude faster.  ...  A practical detail concerns the choice of how many steps to take for each mini-batch. This choice is governed by two aspects.  ... 
arXiv:1704.05786v2 fatcat:aglhxkqu2baslau5nsyf6tyfpa

Understanding Dropout as an Optimization Trick [article]

Sangchul Hahn, Heeyoul Choi
2019 arXiv   pre-print
First, we show that dropout can be explained as an optimization technique to push the input towards the saturation area of nonlinear activation function by accelerating gradient information flowing even  ...  In addition, GAAF works well with batch normalization, while dropout does not.  ...  This shows that GAAF works independently of batch normalization (maybe other optimization techniques too), while dropout hinders batch normalization (or other optimization techniques) by dropping out some  ... 
arXiv:1806.09783v3 fatcat:c3veoaljrjahhfgh6kaxlon7y4

Visual tolerance analysis for engineering optimization

W. Zhou Wei, M. Moore, F. Kussener
2013 International Journal of Metrology and Quality Engineering  
With the help of visual tolerance analysis, engineering and statistical analysts can work together to find the key factors responsible for propagating undesired variation into responses and how to reduce  ...  Classic methodologies of DOE are widely applied in design, manufacture, quality management and related fields.  ...  about how to optimize processes for robustness because it allows the modeling of more than main effects.  ... 
doi:10.1051/ijmqe/2013056 fatcat:2rlfpgusqva65nv6s6xhokijqa

An overview of gradient descent optimization algorithms [article]

Sebastian Ruder
2017 arXiv   pre-print
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.  ...  setting, and investigate additional strategies for optimizing gradient descent.  ...  Batch normalization [9] reestablishes these normalizations for every mini-batch and changes are backpropagated through the operation as well.  ... 
arXiv:1609.04747v2 fatcat:xobv3n2ljvfw5lrmn4ivlus6aa

Optimizer Amalgamation [article]

Tianshu Huang, Tianlong Chen, Sijia Liu, Shiyu Chang, Lisa Amini, Zhangyang Wang
2022 arXiv   pre-print
We are thus motivated to study a new problem named Optimizer Amalgamation: how can we best combine a pool of "teacher" optimizers into a single "student" optimizer that can have stronger problem-specific  ...  Many analytical optimizers have been proposed using a variety of theoretical and empirical approaches; however, none can offer a universal advantage over other competitive optimizers.  ...  However, we believe that input noise is generally not helpful to optimizer amalgamation, and did not study it further.  ... 
arXiv:2203.06474v2 fatcat:slbluz6vaneuzbitjasbeqilfq

Optimizing GPU Cache Policies for MI Workloads [article]

Johnathan Alsop, Matthew D. Sinclair, Srikant Bharadwaj, Alexandru Dutu, Anthony Gutierrez, Onur Kayiran, Michael LeBeane, Sooraj Puthoor, Xianwei Zhang, Tsung Tai Yeh, Bradford M. Beckmann
2019 arXiv   pre-print
Optimizing these workloads is important but complicated.  ...  Based on detailed simulation results, we motivate and evaluate a set of cache optimizations that consistently match the performance of the best static GPU caching policies.  ...  Local response normalization (LRN) and batch normalization (BN) are commonly used normalization layers.  ... 
arXiv:1910.00134v1 fatcat:xycfl3lhubc4rma6p6uw6v2zne

Bayesian Optimization with Gradients [article]

Jian Wu, Matthias Poloczek, Andrew Gordon Wilson, Peter I. Frazier
2018 arXiv   pre-print
However, unlike most optimization methods, Bayesian optimization typically does not use derivative information.  ...  In this paper we show how Bayesian optimization can exploit derivative information to decrease the number of objective function evaluations required for good performance.  ...  (2) The batch GP-UCB-PE method of Contal et al. [5] that does not utilize derivative information, and an extension that does.  ... 
arXiv:1703.04389v3 fatcat:tt7cy6v2hbh3vaz7ttvqmxfzgy

Hyperparameter Optimization with Differentiable Metafeatures [article]

Hadi S. Jomaa, Lars Schmidt-Thieme, Josif Grabocka
2021 arXiv   pre-print
Metafeatures, or dataset characteristics, have been shown to improve the performance of hyperparameter optimization (HPO).  ...  In contrast to existing models, DMFBS i) integrates a differentiable metafeature extractor and ii) is optimized using a novel multi-task loss, linking manifold regularization with a dataset similarity  ...  Experiments Our experiments are designed to answer two research questions: • Q1: Does meta-learning surrogates with end-to-end trainable metafeatures help generalize HPO on a new target dataset?  ... 
arXiv:2102.03776v1 fatcat:pbmuanqz6ngpho6soeegjoko4u

Data optimization for large batch distributed training of deep neural networks [article]

Shubhankar Gahlot, Junqi Yin, Mallikarjun Shankar
2020 arXiv   pre-print
We observe that the loss landscape minimization is shaped by both the model and training data and propose a data optimization approach that utilizes machine learning to implicitly smooth out the loss landscape  ...  Present solutions focus on improving message exchange efficiency as well as implementing techniques to tweak batch sizes and models in the training process.  ...  This low accuracy for smaller batch sizes applies to batch normalization computation and is explored further in [15] .  ... 
arXiv:2012.09272v2 fatcat:hrgrznaeefefpkesxlgvuiycou

Learning to Optimize Domain Specific Normalization for Domain Generalization [article]

Seonguk Seo, Yumin Suh, Dongwan Kim, Geeho Kim, Jongwoo Han, Bohyung Han
2020 arXiv   pre-print
Specifically, we employ batch and instance normalizations in our implementation to identify the best combination of these two normalization methods in each domain.  ...  The optimized normalization layers are effective to enhance the generalizability of the learned model.  ...  In addition, IN does not depend on mini-batch construction or batch statistics, which can be helpful to extrapolate on unseen domains.  ... 
arXiv:1907.04275v3 fatcat:4lrn5su73fa7jjl47o52jnggre

Large-Scale Deep Learning Optimizations: A Comprehensive Survey [article]

Xiaoxin He, Fuzhao Xue, Xiaozhe Ren, Yang You
2021 arXiv   pre-print
We investigate algorithms that are most commonly used for optimizing, elaborate the debatable topic of generalization gap arises in large-batch training, and review the SOTA strategies in addressing the  ...  In this survey, we aim to provide a clear sketch about the optimizations for large-scale deep learning with regard to the model accuracy and model efficiency.  ...  [115] find the LR warm-up stage also helps quite a lot for other optimizers.  ... 
arXiv:2111.00856v2 fatcat:njjaygney5bqpo6ekyntgcjaiq

Optimal Classification of COVID-19: A Transfer Learning Approach

Aditya Kakde, Durgansh Sharma, Nitin Arora
2020 International Journal of Computer Applications  
This paper focuses on the classification which can help in analysis of COVID-19 with normal chest X-ray using deep learning technique.  ...  An optimal solution has been provided using transfer learning approach keeping in mind the limitation of the dataset.  ...  Batch Normalization Batch Normalization is the normalization of output of hidden layer. It is used to reduce the dependency of one hidden layer over the other hidden layer.  ... 
doi:10.5120/ijca2020920165 fatcat:imxgwadepfchljearconybt6mm
« Previous Showing results 1 — 15 out of 112,489 results