14,990 Hits in 2.4 sec

Making SGD Parameter-Free [article]

Yair Carmon, Oliver Hinder
2022 arXiv   pre-print
At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.  ...  In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter  ...  Underlying our algorithm is a parameter-free certificate for SGD, which implies both localization and optimality gap bounds.  ... 
arXiv:2205.02160v1 fatcat:vipfpn2spff4fousyzmomlvpca

Parameter-Free Locally Differentially Private Stochastic Subgradient Descent [article]

Kwang-Sung Jun, Francesco Orabona
2019 arXiv   pre-print
In this work, we propose BANCO (Betting Algorithm for Noisy COins), the first ϵ-LDP SGD algorithm that essentially matches the convergence rate of the tuned SGD without any learning rate parameter, reducing  ...  While it has been shown that stochastic optimization is possible with ϵ-LDP via the standard SGD (Song et al., 2013), its convergence rate largely depends on the learning rate, which must be tuned via  ...  We would like to thank Adam Smith for his valuable feedback on differentially-private SGDs.  ... 
arXiv:1911.09564v1 fatcat:zqeek7eblver7lrk2zswqmn5he

SW-SGD: The Sliding Window Stochastic Gradient Descent Algorithm

Imen Chakroun, Tom Haber, Thomas J. Ashby
2017 Procedia Computer Science  
Mini-batch SGD with batch size n (n-SGD) is often used to control the noise on the gradient and make convergence smoother and more easy to identify, but this can reduce the learning efficiency wrt. epochs  ...  Mini-batch SGD with batch size n (n-SGD) is often used to control the noise on the gradient and make convergence smoother and more easy to identify, but this can reduce the learning efficiency wrt. epochs  ...  For SW-SGD, all but the newest vector in each iteration is available (for free) from the cache.  ... 
doi:10.1016/j.procs.2017.05.082 fatcat:rgwxxuhylneufefscxlcpmnqva

Distributed Hessian-Free Optimization for Deep Neural Network [article]

Xi He and Dheevatsa Mudigere and Mikhail Smelyanskiy and Martin Takáč
2017 arXiv   pre-print
With this objective, we revisit Hessian-free optimization method for deep networks.  ...  However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point.  ...  We then have to make sure that after each iteration of SGD all weights are again synchronized.  ... 
arXiv:1606.00511v2 fatcat:7rhtl7merbgfzagu65fi3ljapi

Training Neural Networks with Stochastic Hessian-Free Optimization [article]

Ryan Kiros
2013 arXiv   pre-print
Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.  ...  Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks.  ...  Hessian-free optimization In this section we review Hessian-free optimization, largely following the implementation of Martens [4] .  ... 
arXiv:1301.3641v3 fatcat:yndfjyterneklcdlgqzekogzxy

Stochastic gradient descent with differentially private updates

Shuang Song, Kamalika Chaudhuri, Anand D. Sarwate
2013 2013 IEEE Global Conference on Signal and Information Processing  
Our results show that standard SGD experiences high variability due to differential privacy, but a moderate increase in the batch size can improve performance significantly.  ...  This can improve the robustness of the updating at a moderate expense in terms of computation, but also introduces the batch size as a free parameter.  ...  It would be interesting to see if increasing the batch size can still make private SGD match non-private SGD in these settings. D.  ... 
doi:10.1109/globalsip.2013.6736861 dblp:conf/globalsip/SongCS13 fatcat:6hy5t2biwzcivdyrvxbhkcm2nm

AsymptoticNG: A regularized natural gradient optimization algorithm with look-ahead strategy [article]

Zedong Tang, Fenlong Jiang, Junke Song, Maoguo Gong, Hao Li, Fan Yu, Zidong Wang, Min Wang
2021 arXiv   pre-print
An immediate idea is to complement the strengths of these algorithms with SGD.  ...  According to the total iteration step, ANG dynamic assembles NG and Euclidean gradient, and updates parameters along the new direction using the intensity of NG.  ...  Evaluated on a minibatch, SGD updates parameters in the model along the negative gradient direction with uniform scale.  ... 
arXiv:2012.13077v2 fatcat:j4bqoyglkvh5lp7q35bddzqg6e

Damped Newton Stochastic Gradient Descent Method for Neural Networks Training

Jingcheng Zhou, Wei Wei, Ruizhi Zhang, Zhiming Zheng
2021 Mathematics  
cost and makes the convergence of the learning process much faster and more accurate than SGD and Adagrad.  ...  In this paper, we explore the convexity of the Hessian matrix of partial parameters and propose the damped Newton stochastic gradient descent (DN-SGD) method and stochastic gradient descent damped Newton  ...  On the other hand, the parameters of the last layer using DN-SGD or SGD-DN for training can make the learning process converge more quickly with just a little more computing cost.  ... 
doi:10.3390/math9131533 fatcat:kqu72qme6jhm3dcehhfa2a4czu

Improved music feature learning with deep neural networks

Siddharth Sigtia, Simon Dixon
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
feature learning for audio data using neural networks: 1.using Rectified Linear Units (ReLUs) instead of standard sigmoid units; 2.using a powerful regularisation technique called Dropout; 3.using Hessian-Free  ...  SGD also takes a very large number of iterations to train sigmoid nets, making training time prohibitively large [10] .  ...  In [18] , Martens makes several modifications to the earlier approaches and develops a version of Hessian Free that can be applied effectively to train very deep networks.  ... 
doi:10.1109/icassp.2014.6854949 dblp:conf/icassp/SigtiaD14 fatcat:flqyqetmzjbyfjyunok3opmtrm

Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting [article]

Keyi Chen, John Langford, Francesco Orabona
2022 arXiv   pre-print
Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance.  ...  In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models.  ...  Research: TRIPODS Institute for Optimization and Learning", no. 1908111 "AF: Small: Collaborative Research: New Representations for Learning Algorithms and Secure Computation", and no. 2046096 "CAREER: Parameter-free  ... 
arXiv:2006.07507v3 fatcat:7phhzpv76zgsjbepx4cbzdlaz4

Mean-normalized stochastic gradient for large-scale deep learning

Simon Wiesler, Alexander Richard, Ralf Schluter, Hermann Ney
2014 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
In our experiments we show that our proposed algorithm converges faster than SGD.  ...  Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm.  ...  A comparison of MN-SGD with a full second-order algorithm as the Hessian-Free algorithm would be of interest too.  ... 
doi:10.1109/icassp.2014.6853582 dblp:conf/icassp/WieslerRSN14 fatcat:memakhv6bbarjdmouxjazagodu

Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks [article]

Xiaodong Cui, Wei Zhang, Zoltán Tüske, Michael Picheny
2018 arXiv   pre-print
ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework in which the optimization alternates between the SGD step and evolution step to improve the average  ...  In addition, individuals in the population optimized with various SGD-based optimizers using distinct hyper-parameters in the SGD step are considered as competing species in a coevolution setting such  ...  After the SGD step, the gradient-free evolution step follows.  ... 
arXiv:1810.06773v1 fatcat:4vyjnr3acfd3fosu2rije64ubu

HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent

Huan Zhang, Cho-Jui Hsieh, Venkatesh Akella
2016 2016 IEEE 16th International Conference on Data Mining (ICDM)  
Stochastic Gradient Descent (SGD) is a popular technique for solving large-scale machine learning problems. In order to parallelize SGD on multi-core machines, asynchronous SGD (HOGWILD!)  ...  In this paper we propose a novel decentralized asynchronous SGD algorithm called HOGWILD ++ that overcomes these drawbacks and shows almost linear speedup on multi-socket NUMA systems.  ...  During the past few years there have been some breakthroughs in parallelizing SGD algorithms including lock-free asynchronous update strategies [17] on multi-core machines, and the use of parameter servers  ... 
doi:10.1109/icdm.2016.0074 dblp:conf/icdm/ZhangHA16 fatcat:x54q4zlwcvhcbj6wa7fdgiefmu

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey [article]

Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li
2020 arXiv   pre-print
This makes there is no change in the updating Eq. (6) using BSP-SGD.  ...  They clarified three major advantages of BCD: (1) higher per epoch efficiency than SGD at early stage; (2) good scalability; (3) gradient-free.  ... 
arXiv:2003.06307v1 fatcat:cdkasj4wdvavhgqlxnwj5kd2kq

On optimization methods for deep learning

Quoc V. Le, Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Andrew Y. Ng
2011 International Conference on Machine Learning  
These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs.  ...  The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize.  ...  In contrast, SGDs have to deal with a noisy estimate of the hidden activation and we have to set the learning rate parameters to be small to make the algorithm more stable.  ... 
dblp:conf/icml/LeNCLPN11 fatcat:s4m4aokdevd6dc5lumiuqulnvu
« Previous Showing results 1 — 15 out of 14,990 results