6,610 Hits in 3.7 sec

Making the Last Iterate of SGD Information Theoretically Optimal [article]

Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli
2019 arXiv   pre-print
While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by  ...  The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of last point of SGD as well as GD.  ...  It was partly answered in [1] which gave sub-optimality bound for the last point of SGD but the obtained sub-optimality rates are O(log T ) worse than the information theoretically optimal rates; T is  ... 
arXiv:1904.12443v2 fatcat:qarctg6orfeiflvifandci3qge

Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale [article]

Jinyang Gao, H.V.Jagadish, Beng Chin Ooi
2015 arXiv   pre-print
Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration.  ...  Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm  ...  number of iterations since the last time x i was used in training.  ... 
arXiv:1512.03880v1 fatcat:ezvhnggpljd5bca6aob7umbfei

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [article]

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
2021 arXiv   pre-print
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.  ...  Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam.  ...  As a result of these and other issues, one has to babysit the optimizer to make sure that training converges to an acceptable training loss, without any guarantee that a given number of iterations is enough  ... 
arXiv:2006.00719v3 fatcat:fqd7xykr3jbmne3ybwsx57xbxu

Accelerating Minibatch Stochastic Gradient Descent using Typicality Sampling [article]

Xinyu Peng, Li Li, Fei-Yue Wang
2019 arXiv   pre-print
Although Mini-batch SGD is one of the most popular stochastic optimization methods in training deep networks, it shows a slow convergence rate due to the large noise in gradient approximation.  ...  We analyze the convergence rate of the resulting typical batch SGD algorithm and compare convergence properties between Minibatch SGD and the algorithm.  ...  In Section V, we theoretically prove the convergence rate of resulting typical batch SGD, and compare it with conventional Mini-batch SGD.  ... 
arXiv:1903.04192v1 fatcat:k72jv7n3krbchmsbmu4l6czwka

Local SGD With a Communication Overhead Depending Only on the Number of Workers [article]

Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
2020 arXiv   pre-print
The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications.  ...  In this paper, we give a new analysis of Local SGD.  ...  Moreover, the bound in (3) is for the last iterate T , and does not require keeping track of a weighted average of all the iterates.  ... 
arXiv:2006.02582v1 fatcat:zwknwwiepvgznlljlgebm7xnmu

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent [article]

Gergely Neu, Gintare Karolina Dziugaite, Mahdi Haghifam, Daniel M. Roy
2021 arXiv   pre-print
Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.  ...  We study the generalization properties of the popular stochastic optimization method known as stochastic gradient descent (SGD) for optimizing general non-convex loss functions.  ...  in a previous version of the proof of the main theorem.  ... 
arXiv:2102.00931v3 fatcat:ryxp7zbud5awtg6exqn5cvpdhy

Stochastic gradient descent methods for estimation with large data sets [article]

Dustin Tran, Panos Toulis, Edoardo M. Airoldi
2015 arXiv   pre-print
Intuitively, an implicit update is a shrinked version of a standard one, where the shrinkage factor depends on the observed Fisher information at the corresponding data point.  ...  Our sgd package in R offers the most extensive and robust implementation of stochastic gradient descent methods.  ...  Despite these theoretical guarantees, explicit sgd requires careful tuning of the hyperparameter γ in the learning rate: small values of the parameter make the iteration (1) very slow to converge in practice  ... 
arXiv:1509.06459v1 fatcat:6mbwz7qi3beovn5bppoc3j4jaa

Reducing Runtime by Recycling Samples [article]

Jialei Wang, Hai Wang, Nathan Srebro
2016 arXiv   pre-print
We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal sample size one should use, and also uncover be-havior that suggests running SDCA for an integer number of epochs could be wasteful  ...  to reuse previously used samples instead of fresh samples, even when fresh samples are available.  ...  ., 2013) are both stochastic optimization methods with almost identical cost-per iteration as SGD, but they maintain information on each of the m training points, in the form of dual variables or cached  ... 
arXiv:1602.02136v1 fatcat:ljvahxcxbfdbfbkzpc7mb2x76q

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

Fanhua Shang, Kaiwen Zhou, Hongying Liu, James Cheng, Ivor Tsang, Lijun Zhang, Dacheng Tao, Jiao Licheng
2018 IEEE Transactions on Knowledge and Data Engineering  
Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively.  ...  The settings allow us to use much larger learning rates, and also make our convergence analysis more challenging.  ...  ACKNOWLEDGMENTS We thank the reviewers for their valuable comments. This work was supported in part by Project supported the  ... 
doi:10.1109/tkde.2018.2878765 fatcat:mlggrkwfvvbobihta7xp6cusri

Multi-Iteration Stochastic Optimizers [article]

Andre Carlon, Luis Espath, Rafael Lopez, Raul Tempone
2020 arXiv   pre-print
We here introduce Multi-Iteration Stochastic Optimizers, a novel class of first-order stochastic optimizers where the coefficient of variation of the mean gradient approximation, its relative statistical  ...  When compared to SGD, SAG, SAGA, SRVG, and SARAH methods, the Multi-Iteration Stochastic Optimizers reduced, without the need to tune parameters for each example, the gradient sampling cost in all cases  ...  [23] , computes an estimate of the gradient at the current iteration by using control variates with respect to the last iteration.  ... 
arXiv:2011.01718v2 fatcat:r7olt4rynvh6blzoexopsi5tie

Training Neural Networks with Stochastic Hessian-Free Optimization [article]

Ryan Kiros
2013 arXiv   pre-print
Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.  ...  HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients.  ...  The author would also like to thank the anonymous ICLR reviewers for their comments and suggestions.  ... 
arXiv:1301.3641v3 fatcat:yndfjyterneklcdlgqzekogzxy

A critical evaluation of stochastic algorithms for convex optimization

Simon Wiesler, Alexander Richard, Ralf Schluter, Hermann Ney
2013 2013 IEEE International Conference on Acoustics, Speech and Signal Processing  
Log-linear models find a wide range of applications in pattern recognition. The training of log-linear models is a convex optimization problem.  ...  In this work, we compare the performance of stochastic and batch optimization algorithms. Stochastic algorithms are fast on large data sets but can not be parallelized well.  ...  In contrast to LBFGS and Rprop, SGD does not make use of any second order information. However, the advantage of SGD is that it frequently updates the model.  ... 
doi:10.1109/icassp.2013.6639010 dblp:conf/icassp/WieslerRSN13 fatcat:yuv24pgmkbhwbhexbxmtjzv6ei

SEBOOST - Boosting Stochastic Learning Using Subspace Optimization Techniques [article]

Elad Richardson, Rom Herskovitz, Boris Ginsburg, Michael Zibulevsky
2016 arXiv   pre-print
SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions.  ...  We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyper-parameters.  ...  Acknowledgements The research leading to these results has received funding from the European Research Council under European Unions Seventh Framework Program, ERC Grant agreement no. 320649 and was supported  ... 
arXiv:1609.00629v1 fatcat:253wp47xarbxvidwuqahyqyexa

Asynchronous Complex Analytics in a Distributed Dataflow Architecture [article]

Joseph E. Gonzalez, Peter Bailis, Michael I. Jordan, Michael J. Franklin, Joseph M. Hellerstein, Ali Ghodsi, Ion Stoica
2015 arXiv   pre-print
Specifically, we investigate the use of asynchronous sideways information passing (ASIP) that presents single-stage parallel iterators with a Volcano-like intra-operator iterator that can be used for asynchronous  ...  information passing.  ...  Acknowledgments This research was supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, the NSF Graduate Research Fellowship (grant DGE-1106400  ... 
arXiv:1510.07092v1 fatcat:32qafevpyjdfffjl64fvpkwnyi

A general preconditioning scheme for difference measures in deformable registration

Darko Zikic, Maximilian Baust, Ali Kamen, Nassir Navab
2011 2011 International Conference on Computer Vision  
The major contribution of this work is a theoretical analysis which demonstrates the improvement of the condition by our approach, which is furthermore shown to be an approximation to the optimal case  ...  We present a preconditioning scheme for improving the efficiency of optimization of arbitrary difference measures in deformable registration problems.  ...  Since L-BFGS and NL-CG operate by utilizing the information about the energy gradient from subsequent iterations, this process for E D is disturbed by the smoothing step, which makes this information inconsistent  ... 
doi:10.1109/iccv.2011.6126224 dblp:conf/iccv/ZikicBKN11 fatcat:3rvjoz3r5bam5brht4oqyj7eje
« Previous Showing results 1 — 15 out of 6,610 results