Filters








37 Hits in 5.8 sec

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification [article]

Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford
2018 arXiv   pre-print
In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and  ...  This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression.  ...  Acknowledgements Sham Kakade acknowledges funding from Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and National Science Foundation (NSF) through awards CCF-1703574 and  ... 
arXiv:1610.03774v4 fatcat:7gzhgqawanbpndi4ztzipfktni

The Implicit Regularization of Stochastic Gradient Flow for Least Squares [article]

Alnur Ali, Edgar Dobriban, Ryan J. Tibshirani
2020 arXiv   pre-print
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression.  ...  We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow.  ...  M., Kidambi, R., Netrapalli, P., and Sidford, A. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification.  ... 
arXiv:2003.07802v2 fatcat:req5g6wjlncsbexknakh7kbyii

Don't Use Large Mini-Batches, Use Local SGD [article]

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
2020 arXiv   pre-print
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks.  ...  Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years.  ...  ACKNOWLEDGEMENTS The authors thank the anonymous reviewers and Thijs Vogels for their precious comments and feedback.  ... 
arXiv:1808.07217v6 fatcat:7cmirv2pxrfafh24xjryn5a7bm

Limitations of the Empirical Fisher Approximation for Natural Gradient Descent [article]

Frederik Kunstner, Lukas Balles, Philipp Hennig
2020 arXiv   pre-print
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information.  ...  We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies  ...  We thank Emtiyaz Khan, Aaron Mishkin, and Didrik Nielsen for many insightful conversations that lead to this work, and the anonymous reviewers for their constructive feedback.  ... 
arXiv:1905.12558v3 fatcat:azrcmvulefbz5ojnxuc2hwdidi

Obtaining Adjustable Regularization for Free via Iterate Averaging [article]

Jingfeng Wu, Vladimir Braverman, Lin F. Yang
2020 arXiv   pre-print
Very recently, Neu and Rosasco show that if we run stochastic gradient descent (SGD) on linear regression problems, then by averaging the SGD iterates properly, we obtain a regularized solution.  ...  In sum, we obtain adjustable regularization for free for a large class of optimization problems and resolve an open question raised by Neu and Rosasco.  ...  Acknowledgement This research is supported in part by NSF CAREER grant 1652257, ONR Award N00014-18-1-2364 and the Lifelong Learning Machines program from DARPA/MTO.  ... 
arXiv:2008.06736v1 fatcat:hh6y34dbgfao5b5oa7d27sxbu4

Scalable estimation strategies based on stochastic approximations: classical results and new insights

Panos Toulis, Edoardo M. Airoldi
2015 Statistics and computing  
methods, such as implicit stochastic gradient descent.  ...  Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which model parameters are updated sequentially using small batches of data at each step.  ...  Acknowledgments The authors wish to thank Leon Bottou, Bob Carpenter, David Dunson, Andrew Gelman, Brian Kulis, Xiao-Li Meng, Natesh Pillai, Neil Shephard, Daniel Sussman and Alexander Volfovsky for useful  ... 
doi:10.1007/s11222-015-9560-y pmid:26139959 pmcid:PMC4484776 fatcat:2qar4lt65bbhldbk5ot2sanmge

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period [article]

Shuheng Shen, Yifei Cheng, Jingchang Liu, Linli Xu
2020 arXiv   pre-print
Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks.  ...  We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD.  ...  Paral- lelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification, 2016.  ... 
arXiv:2006.06377v2 fatcat:ftz2loc74rcv3hsph6jcinjrfe

Smart "Predict, then Optimize" [article]

Adam N. Elmachtoub, Paul Grigas
2020 arXiv   pre-print
for designing better prediction models.  ...  By and large, machine learning tools are intended to minimize prediction error and do not account for how the predictions will be used in the downstream optimization problem.  ...  Acknowledgements The authors gratefully acknowledge the support of NSF Awards CMMI-1763000, CCF-1755705, and CMMI-1762744.  ... 
arXiv:1710.08005v5 fatcat:a3fbloeyznaovhasvswexbzncq

Understanding and Scheduling Weight Decay [article]

Zeke Xie, Issei Sato, Masashi Sugiyama
2021 arXiv   pre-print
Second, we propose a novel weight-decay linear scaling rule for large-batch training that proportionally increases weight decay rather than the learning rate as the batch size increases.  ...  Weight decay is a popular and even necessary regularization technique for training deep neural networks that generalize well.  ...  ACKNOWLEDGEMENT MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.  ... 
arXiv:2011.11152v4 fatcat:gbuwvxetvnbb5cpa6hfpwsh34u

Uncertainty Quantification for Online Learning and Stochastic Approximation via Hierarchical Incremental Gradient Descent [article]

Weijie J. Su, Yuancheng Zhu
2018 arXiv   pre-print
Stochastic gradient descent (SGD) is an immensely popular approach for online learning in settings where data arrives in a stream or data sizes are very large.  ...  The HiGrad procedure begins by performing SGD updates for a while and then splits the single thread into several threads, and this procedure hierarchically operates in this fashion along each thread.  ...  Batch size. Mini-batch gradient descent is a trade-off between SGD and gradient descent.  ... 
arXiv:1802.04876v2 fatcat:w4wrx2ubgff57b6oq7z2fz6pse

Market Segmentation Trees [article]

Ali Aouad, Adam N. Elmachtoub, Kris J. Ferreira, Ryan McNellis
2020 arXiv   pre-print
We provide a customizable, open-source code base for training MSTs in Python which employs several strategies for scalability, including parallel processing and warm starts.  ...  (ii) Isotonic Regression Trees (IRTs) which can be used to solve the bid landscape forecasting problem.  ...  Acknowledgments Elmachtoub and McNellis were partially supported by NSF grant CMMI-1763000.  ... 
arXiv:1906.01174v2 fatcat:5qfzl6wqhzhyfmtas6wqa36uma

Direct loss minimization algorithms for sparse Gaussian processes [article]

Yadi Wei, Rishit Sheth, Roni Khardon
2020 arXiv   pre-print
For the conjugate case, we consider DLM for log-loss and DLM for square loss showing a significant performance improvement in both cases.  ...  Second, a theoretical analysis of biased Monte Carlo estimates (bMC) shows that stochastic gradient descent converges despite the biased gradients. Experiments demonstrate empirical success of DLM.  ...  ., through its support for the Indiana University Pervasive Technology Institute.  ... 
arXiv:2004.03083v3 fatcat:hegv2xk4pjfphfwtlzlnw3lamy

Towards Understanding Generalization via Decomposing Excess Risk Dynamics [article]

Jiaye Teng, Jianhao Ma, Yang Yuan
2021 arXiv   pre-print
The decomposition framework performs well in both linear regimes (overparameterized linear regression) and non-linear regimes (diagonal matrix recovery).  ...  Concretely, we decompose the excess risk dynamics and apply stability-based bound only on the noise component.  ...  Acknowledgement The authors would like to thank Ruiqi Gao for his insightful suggestions. We also thank Tianle Cai, Haowei He, Kaixuan Huang, and, Jingzhao Zhang for their helpful discussions.  ... 
arXiv:2106.06153v2 fatcat:vtexiukberg6pch62mrkvilzme

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [article]

Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, Tianbao Yang
2020 arXiv   pre-print
Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a  ...  In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model.  ...  Acknowledgements This work is partially supported by National Science Foundation CAREER Award 1844403 and National Science Foundation Award 1933212.  ... 
arXiv:2005.02426v2 fatcat:7maye6r27jhfnptts5f3yskcdu

The Effect of Network Width on the Performance of Large-batch Training [article]

Lingjiao Chen and Hongyi Wang and Jinman Zhao and Dimitris Papailiopoulos and Paraschos Koutris
2018 arXiv   pre-print
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training.  ...  Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD.  ...  Acknowledgement This work was supported in part by a gift from Google and AWS Cloud Credits for Research from Amazon. We thank Jeffrey Naughton for invaluable discussions.  ... 
arXiv:1806.03791v1 fatcat:7zmyho7ovvbddnbu2ghwe4esfu
« Previous Showing results 1 — 15 out of 37 results