Filters








26 Hits in 5.6 sec

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression [article]

Jingfeng Wu and Difan Zou and Vladimir Braverman and Quanquan Gu and Sham M. Kakade
2021 arXiv   pre-print
In this paper, we provide problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems.  ...  However, a sharp analysis for the last iterate of SGD with decaying step size in the overparameterized setting is still open.  ...  (by Concluding Remarks In this work, we provide a problem-dependent excess risk bound for the last-iterate of SGD with decaying-stepsize for linear regression.  ... 
arXiv:2110.06198v1 fatcat:zo2p5ql4zvaehmlcfzv3tort6m

Benign Overfitting of Constant-Stepsize SGD for Linear Regression [article]

Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and Sham M. Kakade
2021 arXiv   pre-print
This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging or tail averaging) for linear regression in the overparameterized regime.  ...  For SGD with tail averaging, we show its advantage over SGD with iterate averaging by proving a better excess risk bound together with a nearly matching lower bound.  ...  ., a predictor that fits training data very well but still generalizes, happens for SGD (with constant stepsize and iterate averaging) even for the simple, overparameterized linear regression.  ... 
arXiv:2103.12692v3 fatcat:qdm5gszakjhi7btc6a5wixwm7m

The Benefits of Implicit Regularization from SGD in Least Squares Problems [article]

Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and Dean P. Foster and Sham M. Kakade
2021 arXiv   pre-print
comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.  ...  In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based  ...  B Proof of Gaussian Least Squares B.1 Excess risk bounds of SGD and ridge regression We first recall the excess risk bounds for SGD (with tail averaging) and ridge regression as follows.  ... 
arXiv:2108.04552v1 fatcat:m6q27ka4ezez5iik2qvrrphmzm

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime [article]

Difan Zou and Jingfeng Wu and Vladimir Braverman and Quanquan Gu and Sham M. Kakade
2022 arXiv   pre-print
as a function of the iteration number, stepsize, and data covariance.  ...  The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed  ...  (fixed stepsize, last iterate) can only cooperate with small stepsize.  ... 
arXiv:2203.03159v1 fatcat:gurtdwhemfgwfndyyxz46snrtq

Dimension Independent Generalization Error by Stochastic Gradient Descent [article]

Xi Chen and Qiang Liu and Xin T. Tong
2021 arXiv   pre-print
In particular, we present a general theory on the generalization error of stochastic gradient descent (SGD) solutions for both convex and locally convex loss functions.  ...  The studied statistical applications include both convex models such as linear regression and logistic regression and non-convex models such as M-estimator and two-layer neural networks.  ...  By running through N samples, SGD outputs the N -th iterate w N as the final estimator of w * . In SGD iterations (4), the hyper-parameter η is known as the stepsize.  ... 
arXiv:2003.11196v2 fatcat:mk4tedq2grcztex65boaqv67vq

Towards Understanding Generalization via Decomposing Excess Risk Dynamics [article]

Jiaye Teng, Jianhao Ma, Yang Yuan
2022 arXiv   pre-print
The decomposition framework performs well in both linear regimes (overparameterized linear regression) and non-linear regimes (diagonal matrix recovery).  ...  Concretely, we decompose the excess risk dynamics and apply the stability-based bound only on the noise component.  ...  The authors would like to thank Ruiqi Gao for his insightful suggestions. We also thank Tianle Cai, Haowei He, Kaixuan Huang, and, Jingzhao Zhang for their helpful discussions.  ... 
arXiv:2106.06153v3 fatcat:r5cfnnojnvhwllolfls7u5haxe

Relaxing the Feature Covariance Assumption: Time-Variant Bounds for Benign Overfitting in Linear Regression [article]

Jing Xu, Jiaye Teng, Andrew Chi-Chih Yao
2022 arXiv   pre-print
By introducing the time factor, we relax the strict assumption on the feature covariance matrix required in previous benign overfitting under the regimes of overparameterized linear regression with gradient  ...  This paper extends the scope of benign overfitting, and experiment results indicate that the proposed bound accords better with empirical evidence.  ...  We summarize our contributions as follows: • We derive a time-variant excess risk bound for overparameterized linear regression with gradient descent and provide a time interval in which the excess risk  ... 
arXiv:2202.06054v1 fatcat:3rbhccjgifc3npc2255ssvqgaa

Optimization for deep learning: theory and algorithms [article]

Ruoyu Sun
2019 arXiv   pre-print
This article provides an overview of optimization algorithms and theory for training neural networks.  ...  Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms.  ...  We also thank Ju Sun for the list of related works in the webpage [101] which helps the writing of this article.  ... 
arXiv:1912.08957v1 fatcat:bdtx2o3qhfhthh2vyohkuwnxxa

On Seven Fundamental Optimization Challenges in Machine Learning [article]

Konstantin Mishchenko
2021 arXiv   pre-print
The fifth challenge is the development of an algorithm for distributed optimization with quantized updates that preserves linear convergence of gradient descent.  ...  In particular, we present the first parameter-free stepsize rule for gradient descent that provably works for any locally smooth convex objective.  ...  I am grateful to my coauthors for making this thesis happen. I cannot imagine myself achieving the same research progress without the help of the people that I worked with.  ... 
arXiv:2110.12281v1 fatcat:c4oc7xv6fvdqdik4hwegrcnsqm

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems [article]

Preetum Nakkiran
2020 arXiv   pre-print
In this note, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions.  ...  In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping.  ...  Work supported in part by the Simons Investigator Awards of Boaz Barak and Madhu Sudan, and NSF Awards under grants CCF 1565264, CCF 1715187, and CNS 1618026.  ... 
arXiv:2005.07360v1 fatcat:plkqqrko6rerjerfavquoqkfpu

Recent Theoretical Advances in Non-Convex Optimization [article]

Marina Danilova, Pavel Dvurechensky, Alexander Gasnikov, Eduard Gorbunov, Sergey Guminov, Dmitry Kamzolov, Innokentiy Shibaev
2021 arXiv   pre-print
overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization.  ...  For this setting, we first present known results for the convergence rates of deterministic first-order methods, which are then followed by a general theoretical analysis of optimal stochastic and randomized  ...  Scheinberg for fruitful discussions and their suggestions which helped to improve the quality of the text.  ... 
arXiv:2012.06188v3 fatcat:6cwwns3pnba5zbodlhddof6xai

Implicit Regularization and Convergence for Weight Normalization [article]

Xiaoxia Wu and Edgar Dobriban and Tongzheng Ren and Shanshan Wu and Zhiyuan Li and Suriya Gunasekar and Rachel Ward and Qiang Liu
2020 arXiv   pre-print
For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution.  ...  Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression.  ...  XW, SW, ED, SG, and RW thank the Simons Institute for their hospitality during the Summer 2019 program on the Foundations of Deep Learning.  ... 
arXiv:1911.07956v4 fatcat:aatlegsrkvaw3felgfw2dkeafe

An Improved Analysis of Stochastic Gradient Descent with Momentum [article]

Yanli Liu, Yuan Gao, Wotao Yin
2020 arXiv   pre-print
SGD with momentum (SGDM) has been widely applied in many machine learning tasks, and it is often applied with dynamic stepsizes and momentum weights tuned in a stagewise manner.  ...  Despite of its empirical advantage over SGD, the role of momentum is still unclear in general since previous analyses on SGDM either provide worse convergence bounds than those of SGD, or assume Lipschitz  ...  As a result, (7) stipulates that less iterations are required for stages with large stepsizes and more iterations for stages with small stepsizes.  ... 
arXiv:2007.07989v2 fatcat:43rrsoimezhjlegxil3q7mkvme

Shape Matters: Understanding the Implicit Bias of the Noise Covariance [article]

Jeff Z. HaoChen, Colin Wei, Jason D. Lee, Tengyu Ma
2020 arXiv   pre-print
The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models.  ...  We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense  ...  JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF 2002272. TM acknowledges support of Google Faculty Award.  ... 
arXiv:2006.08680v2 fatcat:g2jg27ryybbpzi2kxh5p7cy3ny

IntSGD: Adaptive Floatless Compression of Stochastic Gradients [article]

Konstantin Mishchenko and Bokun Wang and Dmitry Kovalev and Peter Richtárik
2022 arXiv   pre-print
Our theory shows that the iteration complexity of IntSGD matches that of SGD up to constant factors for both convex and non-convex, smooth and non-smooth functions, with and without overparameterization  ...  We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float.  ...  We tune the initial stepsize in {0.0001, 0.001, 0.01, 0.1, 1} with SGD and the stepsize is divided by 10 at epochs 120 and 160, which is shared by all algorithms.  ... 
arXiv:2102.08374v2 fatcat:4fj5jjriirc5vehj5t3grwit5a
« Previous Showing results 1 — 15 out of 26 results