Filters








97,563 Hits in 3.7 sec

Understanding and Scheduling Weight Decay [article]

Zeke Xie, Issei Sato, Masashi Sugiyama
2021 arXiv   pre-print
Third, we provide an effective learning-rate-aware scheduler for weight decay, called the Stable Weight Decay (SWD) method, which, to the best of our knowledge, is the first practical design for weight  ...  Weight decay is a popular and even necessary regularization technique for training deep neural networks that generalize well.  ...  Stable/Decoupled Weight Decay often outperform L 2 regularization for optimizers involving in momentum.  ... 
arXiv:2011.11152v4 fatcat:gbuwvxetvnbb5cpa6hfpwsh34u

Page 478 of Neural Computation Vol. 8, Issue 3 [page]

1996 Neural Computation  
The smoothing regularizer yields a symmetric a-stable (or leptokurtic) distribution of weights (large peak near zero and long tails), whereas the quadratic weight decay pro- duces a distribution that is  ...  trained with our smoothing regularizer and those with standard weight decay.  ... 

β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search [article]

Peng Ye, Baopu Li, Yikang Li, Tao Chen, Jiayuan Fan, Wanli Ouyang
2022 arXiv   pre-print
To solve these two problems, a simple-but-efficient regularization method, termed as Beta-Decay, is proposed to regularize the DARTS-based NAS searching process.  ...  Specifically, Beta-Decay regularization can impose constraints to keep the value and variance of activated architecture parameters from too large.  ...  As shown in Fig. 2 , DARTS with L2 or weight decay regularization suffers from the performance collapse issue, while DARTS with Beta-Decay regularization has a stable search process.  ... 
arXiv:2203.01665v2 fatcat:3q3daee5lvhs5elcvu6vujzb3e

Tangent-Space Regularization for Neural-Network Models of Dynamical Systems [article]

Fredrik Bagge Carlson, Rolf Johansson, Anders Robertsson
2018 arXiv   pre-print
Furthermore, the influence of L_2 weight regularization on the learned Jacobian eigenvalue spectrum, and hence system stability, is investigated.  ...  This work introduces the concept of tangent space regularization for neural-network models of dynamical systems.  ...  The first examples demonstrate the effectiveness of tangentspace regularization, whereas later examples demonstrate the influence of weight decay. A.  ... 
arXiv:1806.09919v1 fatcat:z37e64da2jbw7cdhaoniu6snee

Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks

Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan
2001 IEEE Transactions on Neural Networks  
weight decay effect as training progresses.  ...  Though the standard RLS algorithm has an implicit weight decay term in its energy function, the weight decay effect decreases linearly as the number of learning epochs increases, thus rendering a diminishing  ...  Background on Weight Decay In the standard weight decay method, there is a quadratic regularization term in the cost function [15] , [20] , given by (1) where the regularization constant is a positive  ... 
doi:10.1109/72.963768 pmid:18249961 fatcat:ntal4y42fvcvlfaef2vmwv7grq

Understanding the Disharmony between Weight Normalization Family and Weight Decay: ϵ-shifted L_2 Regularizer [article]

Li Xiang, Chen Shuo, Xia Yan, Yang Jian
2019 arXiv   pre-print
Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks  ...  Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability.  ...  The central reason is that weight decay helps to control the effective learning rate in a stable and reasonable range.  ... 
arXiv:1911.05920v1 fatcat:frl43x35jrha7p2hloa2wr5wvi

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization [article]

Ziquan Liu, Yufei Cui, Jia Wan, Yu Mao, Antoni B. Chan
2022 arXiv   pre-print
To address those weaknesses, we propose to regularize the weight norm using a simple yet effective weight rescaling (WRS) scheme as an alternative to weight decay.  ...  decay, implicit weight rescaling (weight standardization) and gradient projection (AdamP).  ...  Weight Decay Regularization and BatchNorm Several works have studied the effects of weight decay regularization and its effect on BatchNorm DNNs.  ... 
arXiv:2102.03497v2 fatcat:3cll3hd7cbctlp5yqaepqhyan4

A Smoothing Regularizer for Feedforward and Recurrent Neural Networks

Lizhong Wu, John Moody
1996 Neural Computation  
Empirical results show that the smoothing regularizer yields a real symmetric a-stable (SaS) weight distribution, whereas standard quadratic weight decay produces a normal distribution.  ...  The smoothing regularizer yields a symmetric a-stable (or leptokurtic) distribution of weights (large peak near zero and long tails), whereas the quadratic weight decay pro- duces a distribution that is  ... 
doi:10.1162/neco.1996.8.3.461 fatcat:bqwqnucf2ndfndkb5gvjaevof4

Stable reduction to the pole at the magnetic equator

Yaoguo Li, Douglas W. Oldenburg
2001 Geophysics  
The applied regularization alleviates the singularity associated with the wavenumber-domain RTP operator, and the imposed power spectral decay ensures that the constructed RTP field has the correct spectral  ...  We develop a solution to this problem that allows stable reconstruction of the RTP field with a high fidelity even at the magnetic equator.  ...  It is therefore necessary to incorporate the knowledge about the spectral decay through the use of the weighting function.  ... 
doi:10.1190/1.1444948 fatcat:dpa7uisnnfdmxlkgbwi6yq5xdm

Understanding the Disharmony between Weight Normalization Family and Weight Decay

Xiang Li, Shuo Chen, Jian Yang
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks  ...  Moreover, if we substitute (e.g., weight normalization) W′ = W∥W∥ in the original loss function ∑i L(ƒ(xi; W′),yi) + ½λ∥W′∥2, it is observed that the regularization term ½λ∥W′∥2 will be canceled as a constant  ...  The central reason is that weight decay helps to control the effective learning rate in a stable and reasonable range.  ... 
doi:10.1609/aaai.v34i04.5904 fatcat:f6t3x7jfo5bs5dm4cmsmygdfre

Surprising Instabilities in Training Deep Networks and a Theoretical Analysis [article]

Yuxin Sun, Dong Lao, Ganesh Sundaramoorthi, Anthony Yezzi
2022 arXiv   pre-print
We show that it is stable only under certain conditions on the learning rate and weight decay.  ...  ., localized over iterations and regions of the weight tensor space.  ...  However in the case that a < 0, the weight decay must be chosen large enough to be stable.  ... 
arXiv:2206.02001v1 fatcat:wcfw6tjyubgthe2dlqpqhmm7na

Page 2116 of The Journal of Neuroscience Vol. 16, Issue 6 [page]

1996 The Journal of Neuroscience  
rule with weight decay (see Appendix 2).  ...  C, When the weight regularization is too strong, the actual stable firing profile tends to be blunter than the desired one, or even becomes totally flat (not shown).  ... 

How I Learned to Stop Worrying and Love Retraining [article]

Max Zimmer, Christoph Spiegel, Sebastian Pokutta
2022 arXiv   pre-print
weight decay.  ...  For the retraining phase we deactivate weight decay. DPF As for GMP, we tune the number of pruning steps, i.e., {20, 100}, and the weight decay.  ...  Secondly, we denote the time needed when compared to regular training of a dense model, e.g. LC needs 1.14 times as much runtime as regular training.  ... 
arXiv:2111.00843v2 fatcat:gyet2ak2mrhuzgqzoqymva7uaa

Understanding the Role of Training Regimes in Continual Learning [article]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Razvan Pascanu, Hassan Ghasemzadeh
2020 arXiv   pre-print
However, there has been limited prior work extensively analyzing the impact that different training regimes -- learning rate, batch size, regularization method-- can have on forgetting.  ...  In particular, we study the effect of dropout, learning rate decay, and batch size, on forming training regimes that widen the tasks' local minima and consequently, on helping it not to forget catastrophically  ...  Regularization: dropout and weight decay We relate the theoretical insights on dropout and L 2 regularization (weight decay) to our analysis in the previous section.  ... 
arXiv:2006.06958v1 fatcat:kq545vj3brf6nchbxjo3rlwnb4

Stabilization of the inverse Laplace transform of multiexponential decay through introduction of a second dimension

Hasan Celik, Mustapha Bouhrara, David A. Reiter, Kenneth W. Fishbein, Richard G. Spencer
2013 Journal of magnetic resonance (San Diego, Calif. 1997 : Print)  
We propose a new approach to stabilizing the inverse Laplace transform of a multiexponential decay signal, a classically ill-posed problem, in the context of nuclear magnetic resonance relaxometry.  ...  We find markedly improved accuracy, and stability with respect to noise, as well as insensitivity to regularization in quantifying underlying relaxation components through use of the two-dimensional as  ...  We find markedly improved stability, accuracy, and insensitivity to regularization. Celik et al. Page 8  ... 
doi:10.1016/j.jmr.2013.07.008 pmid:24035004 pmcid:PMC3818505 fatcat:wrrwuxmdbfhpjkyae6sjjrkno4
« Previous Showing results 1 — 15 out of 97,563 results