A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Training (Overparametrized) Neural Networks in Near-Linear Time
[article]
2020
arXiv
pre-print
We show how to speed up the algorithm of [CGH+19], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of ...
Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19, yielding an O(mn^2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial ...
for linear training time. ...
arXiv:2006.11648v2
fatcat:b7dmwuurivetnafmbznqtghnum
Convergence of Adversarial Training in Overparametrized Neural Networks
[article]
2019
arXiv
pre-print
In addition, we also prove that robust interpolation requires more model capacity, supporting the evidence that adversarial training requires wider networks. ...
Neural networks are vulnerable to adversarial examples, i.e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network. ...
RG and TC are partially supported by the elite undergraduate training program of School of Mathematical Sciences in Peking University. ...
arXiv:1906.07916v2
fatcat:iem4qltxe5b4pcj5t7wik2uwpy
Effect of Activation Functions on the Training of Overparametrized Neural Nets
[article]
2020
arXiv
pre-print
In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. ...
It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. ...
Theoretical analysis of training of highly overparametrized neural networks. ...
arXiv:1908.05660v4
fatcat:qa4dgxlyu5dvjfq2aeh46tbu3u
Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently
[article]
2019
arXiv
pre-print
In this paper, we show that neural networks can be trained to memorize training data perfectly in a mildly overparametrized regime, where the number of parameters is just a constant factor more than the ...
It has been observed that deep neural networks can memorize: they achieve 100% accuracy on training data. ...
is linear in n, and simple optimization algorithms on such neural networks can fit any training data. ...
arXiv:1909.11837v1
fatcat:3vhwm24kcvec7mofry3f6ijboa
Optimisation of Overparametrized Sum-Product Networks
[article]
2019
arXiv
pre-print
In fact, gradient-based optimisation in deep tree-structured sum-product networks is equal to gradient ascend with adaptive and time-varying learning rates and additional momentum terms. ...
This paper examines the effects of overparameterization in sum-product networks on the speed of parameter optimisation. ...
Background and Related Work
Overparameterization in Linear Networks Recent work has shown that increasing depth in linear neural networks can speed up the optimisation (Arora et al., 2018) . ...
arXiv:1905.08196v2
fatcat:xxr4deajafavpdlkzjt2qaz5m4
Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks
[article]
2022
arXiv
pre-print
Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearized ...
In the feature-learning regime, this dual formulation justifies using a two time-scale gradient ascent-descent (GDA) training algorithm in which one updates concurrently the particles in the sample space ...
D Training overparametrized two-layer neural networks via sampling In the previous section we described how the general duality result from App. ...
arXiv:2107.05134v2
fatcat:xcccovjaufgwxen5n6kffhk6q4
Kernel and Rich Regimes in Overparametrized Models
[article]
2020
arXiv
pre-print
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent ...
This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. ...
Introduction A string of recent papers study neural networks trained with gradient descent in the "kernel regime." ...
arXiv:1906.05827v3
fatcat:fgze2767q5aj7o2tly3u2kw5ni
Kernel and Rich Regimes in Overparametrized Models
[article]
2020
arXiv
pre-print
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent ...
This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. ...
Introduction A string of recent papers study neural networks trained with gradient descent in the "kernel regime." ...
arXiv:2002.09277v3
fatcat:3fm5ojsto5dztclosgmtnfskyu
Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena
[article]
2020
arXiv
pre-print
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly ...
exceeds the sample sizes, and the model perfectly fits the in-training data. ...
[GK17] use approximation by polynomials similar to our polynomial regression method to learn L = 1 hidden layered neural network in polynomial time and more general monotone non-linear outward layer assumption ...
arXiv:2003.10523v1
fatcat:fxpd6dnysbavdcvldld4v6cnae
Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks
[article]
2022
arXiv
pre-print
We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of the ...
activation functions used in practice, including ReLU. ...
for Shallow Neural Networks". ...
arXiv:2201.12052v1
fatcat:g6vw2naanvfzbdgcd3vjc3yc5i
Tractability from overparametrization: The example of the negative perceptron
[article]
2021
arXiv
pre-print
In other words, δ_s(κ) is the overparametrization threshold: for n/d≤δ_s(κ)-ε a classifier achieving vanishing training error exists with high probability, while for n/d≥δ_s(κ)+ε it does not. ...
We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold δ_lin(κ). ...
In contrast, modern neural networks often achieve vanishing training error even if the true labels are replaced by purely random ones [ZBH + 21, BMR21]. ...
arXiv:2110.15824v2
fatcat:jpofhl4xqvgidlztgc54xpxfqq
The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime
[article]
2020
arXiv
pre-print
In particular we consider a specific structure of (θ_*,Σ) that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights. ...
Max-margin linear classifiers are among the simplest classification methods that have zero training error (with linearly separable data). ...
First of all, most applications of neural networks are to classification rather than regression. ...
arXiv:1911.01544v2
fatcat:lpsr6zy7mjge3d4tfnb5d3vhcq
Algorithmic Regularization in Model-free Overparametrized Asymmetric Matrix Factorization
[article]
2022
arXiv
pre-print
In particular, our complexity bound is almost dimension-free and depends logarithmically on the final error, and our results have lenient requirements on the stepsize and initialization. ...
networks [Du et al., 2018, Ye and Du, 2021] . ...
As such, understanding the dynamics of (GD-M) provides deep intuition for these more general problems and is often regarded as an important first step for understanding various aspects of (linear) neural ...
arXiv:2203.02839v1
fatcat:4jd475aizrbdneutduavv4pqnu
Strength of Minibatch Noise in SGD
[article]
2022
arXiv
pre-print
For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization ...
We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. ...
width to stabilize neural network training. ...
arXiv:2102.05375v3
fatcat:kmxmaqi6rncjnono5xadthvy4q
Predicting Training Time Without Training
[article]
2020
arXiv
pre-print
To do so, we leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. ...
In our experiments, we are able to predict training time of a ResNet within a 20% error margin on a variety of datasets and hyper-parameters, at a 30 to 45-fold reduction in cost compared to actual training ...
We look to efficiently estimate the number of training steps a Deep Neural Network (DNN) needs to converge to a given value of the loss function, without actually having to train the network. ...
arXiv:2008.12478v1
fatcat:hzdu6sm4hjelffjkk3vwyl73ru
« Previous
Showing results 1 — 15 out of 332 results