Filters








5,168 Hits in 12.0 sec

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks [article]

Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen
2020 arXiv   pre-print
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay.  ...  In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or  ...  Conclusion We propose NovoGrad -an adaptive SGD method with gradients normalized by the layer-wise second moment and with decoupled weight decay.  ... 
arXiv:1905.11286v3 fatcat:mbibr3v3ebbenedhwsxzokbcwi

Handwritten Devanagari Character Recognition Using Layer-Wise Training of Deep Convolutional Neural Networks and Adaptive Gradient Methods

Mahesh Jangid, Sumit Srivastava
2018 Journal of Imaging  
the use of six recently developed adaptive gradient methods.  ...  The results of layer-wise-trained DCNN are favorable in comparison with those achieved by a shallow technique of handcrafted features and standard DCNN. gradients) [5] , SIFT (scale-invariant feature transform  ...  was not possible for the handwritten Devanagari characters.  ... 
doi:10.3390/jimaging4020041 fatcat:d7syngifw5ajbn34lag3hab77y

Reinforced stochastic gradient descent for deep neural network learning [article]

Haiping Huang, Taro Toyoizumi
2017 arXiv   pre-print
Stochastic gradient descent (SGD) is a standard optimization method to minimize a training error with respect to network parameters in modern neural network learning.  ...  abilities of deep networks.  ...  Acknowledgments We are grateful to the anonymous referee for many constructive comments. H.H. thanks Dr. Alireza Goudarzi for a lunch discussion which later triggered the idea of this work.  ... 
arXiv:1701.07974v5 fatcat:lja5itgbcjfh5p7q53hsrnoonm

BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model

Junjie Liu, Dongchao Wen, Deyu Wang, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
constraining the range of gradients is critical for optimizing the deep binary model to avoid highly suboptimal solutions.  ...  In this paper, we provide an explicit convex optimization example where training the BNNs with the traditionally adaptive optimization methods still faces the risk of non-convergence, and identify that  ...  For example, the method [52, 35] with the idea of blended gradients exhibits a significant improvement for training BNNs.  ... 
doi:10.1109/cvprw50498.2020.00345 dblp:conf/cvpr/LiuWWTCOK20 fatcat:bxqam3csm5g47nuxw3u3zk3wwu

BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model [article]

Junjie Liu, Dongchao Wen, Deyu Wang, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato
2020 arXiv   pre-print
In this paper, we provide an explicit convex optimization example where training the BNNs with the traditionally adaptive optimization methods still faces the risk of non-convergence, and identify that  ...  Recent methods have significantly reduced the performance degradation of Binary Neural Networks (BNNs), but guaranteeing the effective and efficient training of BNNs is an unsolved problem.  ...  For example, the method [52, 35] with the idea of blended gradients exhibits a significant improvement for training BNNs.  ... 
arXiv:2009.13799v1 fatcat:oia2pd4pznellcg62rcyy3wkra

Layer-wise and Dimension-wise Locally Adaptive Federated Learning [article]

Belhal Karimi, Ping Li, Xiaoyun Li
2022 arXiv   pre-print
In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates.  ...  Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases.  ...  When training deep networks, in many cases the scale of gradients differs a lot across the network layers.  ... 
arXiv:2110.00532v3 fatcat:ga5nydgdhjdzpmy776k2q6rihe

Disentangling Adaptive Gradient Methods from Learning Rates [article]

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang
2020 arXiv   pre-print
and generalization of neural network training.  ...  We investigate several confounding factors in the evaluation of optimization algorithms for deep learning.  ...  Acknowledgements We are grateful to Sanjeev Arora, Yi Zhang, Zhiyuan Li, Wei Hu, Yoram Singer, Kunal Talwar, Roger Grosse, Karthik Narasimhan, Mark Braverman, Surbhi Goel, and Sham Kakade for helpful discussions  ... 
arXiv:2002.11803v1 fatcat:m3u3sgfsznf6pa7kuf7uomvva4

Optimizing Deep Network for Image Classification with Hyper Parameter Tuning

2019 International Journal of Engineering and Advanced Technology  
The present work focuses on an empirical analysis of the performance of stochastic optimization methods with regard to hyperparameters for the deep Convolution Neural Network (CNN) and to understand the  ...  The deep network model comprises of several processing layers and deep learning techniques help us in representing data with diverse levels of abstraction.  ...  Adam [10] is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.  ... 
doi:10.35940/ijeat.b3515.129219 fatcat:3qdxulcd5jduvinhsnlziq4kry

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks [article]

Jinghui Chen and Dongruo Zhou and Yiqi Tang and Ziyan Yang and Yuan Cao and Quanquan Gu
2020 arXiv   pre-print
stochastic gradient descent (SGD) with momentum in training deep neural networks.  ...  These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.  ...  We also thank AWS for providing cloud computing credits associated with the NSF BIGDATA award.  ... 
arXiv:1806.06763v3 fatcat:i2ly353yhnegfda43ugwtdztsa

Neighbor Combinatorial Attention for Critical Structure Mining

Tanli Zuo, Yukun Qiu, Wei-Shi Zheng
2020 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence  
However, existing GNN methods do not explicitly extract critical structures, which reflect the intrinsic property of a graph.  ...  Graph convolutional networks (GCNs) have been widely used to process graph-structured data.  ...  We also thank AWS for providing cloud computing credits associated with the NSF BIGDATA award.  ... 
doi:10.24963/ijcai.2020/452 dblp:conf/ijcai/ChenZTYCG20 fatcat:yjvg5fobdvfnrmwhzzqmtqwcni

Training Neural Networks with Implicit Variance [chapter]

Justin Bayer, Christian Osendorfer, Sebastian Urban, Patrick van der Smagt
2013 Lecture Notes in Computer Science  
We present a novel method to train predictive Gaussian distributions p(z|x) for regression problems with neural networks.  ...  Establishing stochasticty by the injection of noise into the input and hidden units, the outputs are approximated with a Gaussian distribution by the forward propagation method introduced for fast dropout  ...  We trained the networks for Conclusion and Future Work We presented a novel method to estimate predictive distributions via deep neural networks that plays nicely with fast dropout.  ... 
doi:10.1007/978-3-642-42042-9_17 fatcat:bjggng4xgvgtvnsumdrtoto6iy

Gradient Adapter for Hard-Threshold Deep Neural Networks

Nanxing Li, Hong Ni, Yiqiang Sheng, Zhenyu Zhao
2019 International Journal of Innovative Computing, Information and Control  
Those functions allow for the creation of large integrated systems of deep neural networks, which may have non-differentiable components and must prevent vanishing and exploding gradients for effective  ...  As neural networks grow deeper, learning approaches with hard-threshold activation functions are becoming increasingly important for reducing computational time and energy consumption.  ...  There is also research on learning for deep neural networks with hard-threshold stochastic units by estimating gradient using methods such as finite-difference approximation [29, 31] and stochastic perturbations  ... 
doi:10.24507/ijicic.15.03.1023 fatcat:ry5rmnrsrfem3k3tzzxv5zwgdy

ADAMT: A Stochastic Optimization with Trend Correction Scheme [article]

Bingxin Zhou, Xuebin Zheng, Junbin Gao
2020 arXiv   pre-print
Adam-type optimizers, as a class of adaptive moment estimation methods with the exponential moving average scheme, have been successfully used in many applications of deep learning.  ...  Such methods are appealing for capability on large-scale sparse datasets with high computational efficiency. In this paper, we present a new framework for adapting Adam-type methods, namely AdamT.  ...  So far, the adaptive methods with exponential moving average gradients have gained great attention with huge success in many deep learning tasks.  ... 
arXiv:2001.06130v1 fatcat:xtu7tek2abb4locgahkij66vym

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees [article]

Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, Xiaowen Chu
2020 arXiv   pre-print
To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers.  ...  In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme.  ...  We acknowledge Nvidia AI Technology Centre (NVAITC) for providing GPU clusters for experiments.  ... 
arXiv:1911.08727v4 fatcat:p3cpnlhmpvhnhiqrhjssryzyza

ProbAct: A Probabilistic Activation Function for Deep Neural Networks [article]

Kumar Shridhar, Joonho Lee, Hideaki Hayashi, Purvanshi Mehta, Brian Kenji Iwana, Seokjun Kang, Seiichi Uchida, Sheraz Ahmed, Andreas Dengel
2020 arXiv   pre-print
The values of mean and variances can be fixed using known functions or trained for each element.  ...  Activation functions play an important role in training artificial neural networks.  ...  For our experiments, we train µ element-wise with an initialization of µ(x) = max(0, x).  ... 
arXiv:1905.10761v2 fatcat:vugyr4pa5vgodnoolp4pxrvqwq
« Previous Showing results 1 — 15 out of 5,168 results