Filters








332 Hits in 4.1 sec

Training (Overparametrized) Neural Networks in Near-Linear Time [article]

Jan van den Brand, Binghui Peng, Zhao Song, Omri Weinstein
2020 arXiv   pre-print
We show how to speed up the algorithm of [CGH+19], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of  ...  Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19, yielding an O(mn^2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial  ...  for linear training time.  ... 
arXiv:2006.11648v2 fatcat:b7dmwuurivetnafmbznqtghnum

Convergence of Adversarial Training in Overparametrized Neural Networks [article]

Ruiqi Gao, Tianle Cai, Haochuan Li, Liwei Wang, Cho-Jui Hsieh, Jason D. Lee
2019 arXiv   pre-print
In addition, we also prove that robust interpolation requires more model capacity, supporting the evidence that adversarial training requires wider networks.  ...  Neural networks are vulnerable to adversarial examples, i.e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network.  ...  RG and TC are partially supported by the elite undergraduate training program of School of Mathematical Sciences in Peking University.  ... 
arXiv:1906.07916v2 fatcat:iem4qltxe5b4pcj5t7wik2uwpy

Effect of Activation Functions on the Training of Overparametrized Neural Nets [article]

Abhishek Panigrahi, Abhishek Shetty, Navin Goyal
2020 arXiv   pre-print
In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks.  ...  It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings.  ...  Theoretical analysis of training of highly overparametrized neural networks.  ... 
arXiv:1908.05660v4 fatcat:qa4dgxlyu5dvjfq2aeh46tbu3u

Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently [article]

Rong Ge, Runzhe Wang, Haoyu Zhao
2019 arXiv   pre-print
In this paper, we show that neural networks can be trained to memorize training data perfectly in a mildly overparametrized regime, where the number of parameters is just a constant factor more than the  ...  It has been observed that deep neural networks can memorize: they achieve 100% accuracy on training data.  ...  is linear in n, and simple optimization algorithms on such neural networks can fit any training data.  ... 
arXiv:1909.11837v1 fatcat:3vhwm24kcvec7mofry3f6ijboa

Optimisation of Overparametrized Sum-Product Networks [article]

Martin Trapp and Robert Peharz and Franz Pernkopf
2019 arXiv   pre-print
In fact, gradient-based optimisation in deep tree-structured sum-product networks is equal to gradient ascend with adaptive and time-varying learning rates and additional momentum terms.  ...  This paper examines the effects of overparameterization in sum-product networks on the speed of parameter optimisation.  ...  Background and Related Work Overparameterization in Linear Networks Recent work has shown that increasing depth in linear neural networks can speed up the optimisation (Arora et al., 2018) .  ... 
arXiv:1905.08196v2 fatcat:xxr4deajafavpdlkzjt2qaz5m4

Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks [article]

Carles Domingo-Enrich, Alberto Bietti, Marylou Gabrié, Joan Bruna, Eric Vanden-Eijnden
2022 arXiv   pre-print
Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearized  ...  In the feature-learning regime, this dual formulation justifies using a two time-scale gradient ascent-descent (GDA) training algorithm in which one updates concurrently the particles in the sample space  ...  D Training overparametrized two-layer neural networks via sampling In the previous section we described how the general duality result from App.  ... 
arXiv:2107.05134v2 fatcat:xcccovjaufgwxen5n6kffhk6q4

Kernel and Rich Regimes in Overparametrized Models [article]

Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro
2020 arXiv   pre-print
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent  ...  This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.  ...  Introduction A string of recent papers study neural networks trained with gradient descent in the "kernel regime."  ... 
arXiv:1906.05827v3 fatcat:fgze2767q5aj7o2tly3u2kw5ni

Kernel and Rich Regimes in Overparametrized Models [article]

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro
2020 arXiv   pre-print
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent  ...  This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.  ...  Introduction A string of recent papers study neural networks trained with gradient descent in the "kernel regime."  ... 
arXiv:2002.09277v3 fatcat:3fm5ojsto5dztclosgmtnfskyu

Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena [article]

Matt Emschwiller, David Gamarnik, Eren C. Kızıldağ, Ilias Zadik
2020 arXiv   pre-print
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly  ...  exceeds the sample sizes, and the model perfectly fits the in-training data.  ...  [GK17] use approximation by polynomials similar to our polynomial regression method to learn L = 1 hidden layered neural network in polynomial time and more general monotone non-linear outward layer assumption  ... 
arXiv:2003.10523v1 fatcat:fxpd6dnysbavdcvldld4v6cnae

Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks [article]

Bartłomiej Polaczyk, Jacek Cyranka
2022 arXiv   pre-print
We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of the  ...  activation functions used in practice, including ReLU.  ...  for Shallow Neural Networks".  ... 
arXiv:2201.12052v1 fatcat:g6vw2naanvfzbdgcd3vjc3yc5i

Tractability from overparametrization: The example of the negative perceptron [article]

Andrea Montanari, Yiqiao Zhong, Kangjie Zhou
2021 arXiv   pre-print
In other words, δ_s(κ) is the overparametrization threshold: for n/d≤δ_s(κ)-ε a classifier achieving vanishing training error exists with high probability, while for n/d≥δ_s(κ)+ε it does not.  ...  We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold δ_lin(κ).  ...  In contrast, modern neural networks often achieve vanishing training error even if the true labels are replaced by purely random ones [ZBH + 21, BMR21].  ... 
arXiv:2110.15824v2 fatcat:jpofhl4xqvgidlztgc54xpxfqq

The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime [article]

Andrea Montanari and Feng Ruan and Youngtak Sohn and Jun Yan
2020 arXiv   pre-print
In particular we consider a specific structure of (θ_*,Σ) that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights.  ...  Max-margin linear classifiers are among the simplest classification methods that have zero training error (with linearly separable data).  ...  First of all, most applications of neural networks are to classification rather than regression.  ... 
arXiv:1911.01544v2 fatcat:lpsr6zy7mjge3d4tfnb5d3vhcq

Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization [article]

Liwei Jiang, Yudong Chen, Lijun Ding
2022 arXiv   pre-print
In particular, our complexity bound is almost dimension-free and depends logarithmically on the final error, and our results have lenient requirements on the stepsize and initialization.  ...  networks [Du et al., 2018, Ye and Du, 2021] .  ...  As such, understanding the dynamics of (GD-M) provides deep intuition for these more general problems and is often regarded as an important first step for understanding various aspects of (linear) neural  ... 
arXiv:2203.02839v1 fatcat:4jd475aizrbdneutduavv4pqnu

Strength of Minibatch Noise in SGD [article]

Liu Ziyin, Kangqiao Liu, Takashi Mori, Masahito Ueda
2022 arXiv   pre-print
For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization  ...  We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima.  ...  width to stabilize neural network training.  ... 
arXiv:2102.05375v3 fatcat:kmxmaqi6rncjnono5xadthvy4q

Predicting Training Time Without Training [article]

Luca Zancato, Alessandro Achille, Avinash Ravichandran, Rahul Bhotika, Stefano Soatto
2020 arXiv   pre-print
To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.  ...  In our experiments, we are able to predict training time of a ResNet within a 20% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training  ...  We look to efficiently estimate the number of training steps a Deep Neural Network (DNN) needs to converge to a given value of the loss function, without actually having to train the network.  ... 
arXiv:2008.12478v1 fatcat:hzdu6sm4hjelffjkk3vwyl73ru
« Previous Showing results 1 — 15 out of 332 results