A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
SGD with Variance Reduction beyond Empirical Risk Minimization
[article]
2016
arXiv
pre-print
The proposed algorithm is doubly stochastic in the sense that gradient steps are done using stochastic gradient descent (SGD) with variance reduction, where the inner expectations are approximated by a ...
Conclusion We have proposed a doubly stochastic gradient algorithm to extend SGD-like algorithms beyond the empirical risk minimization setting. ...
This makes this setting quite different from the usual case of empirical risk minimization (linear regression, logistic regression, etc.), where all the gradients ∇f i share the same low numerical cost ...
arXiv:1510.04822v3
fatcat:bgktozbtjjfxbezoqkfcyq52zu
Hedging foreign exchange rate risk: Multi-currency diversification
2016
European Journal of Management and Business Economics
The most widely used optimal hedge ratio is the so-called minimum-variance (MV) hedge ratio. This is a single objective problem where the risk, measured with the variance, is minimized. ...
In the long scenario, the total reduction in VaR and CVaR ranges from 1% for SGD to 39.3% for AUD and from 3% for SGD to 38.4% for AUD with an average reduction of 26.4% and 23.9%. ...
#h > 0: total number of currencies in the hedging portfolio with a long position; #h < 0: total number of currencies in the hedging portfolio with a short position; Total: risk reduction for minimum VaR ...
doi:10.1016/j.redee.2015.11.003
fatcat:qanefbwbcbbh3ktchpuvx5tmci
Reducing Runtime by Recycling Samples
[article]
2016
arXiv
pre-print
Contrary to the situation with stochastic gradient descent, we argue that when using stochastic methods with variance reduction, such as SDCA, SAG or SVRG, as well as their variants, it could be beneficial ...
We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal sample size one should use, and also uncover be-havior that suggests running SDCA for an integer number of epochs could be wasteful ...
Preliminaries: SVM-Type Objectives and Stochastic Optimization Consider SVM-type training, where we learn a linear predictor by regularized empirical risk minimization with a convex loss (hinge loss for ...
arXiv:1602.02136v1
fatcat:ljvahxcxbfdbfbkzpc7mb2x76q
Accelerating Stochastic Gradient Descent Using Antithetic Sampling
[article]
2018
arXiv
pre-print
But a rather high variance introduced by the stochastic gradient in each step may slow down the convergence. ...
In this paper, we propose the antithetic sampling strategy to reduce the variance by taking advantage of the internal structure in dataset. ...
And in recent years, SGD has been widely used to minimize the empirical risk in machine learning community [Shalev-Shwartz et al., 2007; Shamir and Zhang, 2013; Bottou et al., 2016] . ...
arXiv:1810.03124v1
fatcat:fwneo6dz6vfcra6uxs5y7rdvsy
Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics
[article]
2016
arXiv
pre-print
data with low variance that take the form of averages over k-tuples. ...
to as incomplete U-statistics, without damaging the O_P(1/√(n)) learning rate of Empirical Risk Minimization (ERM) procedures. ...
Bellet was affiliated with Télécom ParisTech. ...
arXiv:1501.02629v4
fatcat:22kpgfipvjdfpcxx6mzuycqfdm
Scaling-up Distributed Processing of Data Streams for Machine Learning
[article]
2020
arXiv
pre-print
For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. ...
In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that ...
(SRM) and empirical risk minimization (ERM). ...
arXiv:2005.08854v2
fatcat:y6fvajvq2naajeqs6lo3trrgwy
On the Benefits of Invariance in Neural Networks
[article]
2020
arXiv
pre-print
We prove that training with data augmentation leads to better estimates of risk and gradients thereof, and we provide a PAC-Bayes generalization bound for models trained with data augmentation. ...
We provide empirical support of these theoretical results, including a demonstration of why generalization may not improve by training with data augmentation: the 'learned invariance' fails outside of ...
Combined with (21) , the reduction in empirical augmented risk follows. The reduction in R • (Q, D n ) follows trivially. ...
arXiv:2005.00178v1
fatcat:45lmcynbjnertgapp6x2ok2yu4
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
[article]
2018
arXiv
pre-print
Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction. ...
(b) SGD iteration with mini-batch m> m^* is nearly equivalent to a full gradient descent iteration (saturation regime). ...
The table on the right compares the convergence of SGD in the interpolation setting with several popular variance reduction methods. ...
arXiv:1712.06559v3
fatcat:cgm2ieqksfa3bcza3zb3fgn52e
Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD
[article]
2021
arXiv
pre-print
When the number N of data items is large, SGD relies on constructing an unbiased estimator of the gradient of the empirical risk using a small subset of the original dataset, called a minibatch. ...
Default minibatch construction involves uniformly sampling a subset of the desired size, but alternatives have been explored for variance reduction. ...
This indicates that there is variance reduction beyond the change of the rate. ...
arXiv:2112.06007v1
fatcat:7zr3zigelzgzla6pzeqtkwjknu
Early Stopping without a Validation Set
[article]
2017
arXiv
pre-print
It merely ensures that we do not minimize the empirical risk L D of a given model beyond the point of best generalization. ...
Often there is easy access to the gradient of and gradient-based optimizers can be used to minimize the empirical risk. ...
-Supplements- 5 Comparison to RMSPROP This Section explores the differences and similarities of SGD+EB-criterion and RMSPROP. ...
arXiv:1703.09580v3
fatcat:vpfieu2lcrdkla3s6vybmkgdve
Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification
[article]
2018
arXiv
pre-print
for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. ...
These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly ...
the standard Empirical Risk Minimizer (or, Maximum Likelihood Estimator) (Lehmann and Casella, 1998; van der Vaart, 2000) . ...
arXiv:1610.03774v4
fatcat:7gzhgqawanbpndi4ztzipfktni
An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning
[article]
2021
arXiv
pre-print
The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. ...
Cotter et al. (2011), which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin (2018), which is limited to least squares problems and is also similarly suboptimal with ...
Acknowledgements We thank Ohad Shamir for several helpful discussions in the process of preparing this article, and also George Lan for a conversation about optimization with bounded σ * . ...
arXiv:2106.02720v2
fatcat:23jfzoqpdrcmxfqttwtunx5bi4
Training Efficiency and Robustness in Deep Learning
[article]
2021
arXiv
pre-print
We formalize a simple trick called hard negative mining as a modification to the learning objective function with no computational overhead. ...
Finally, we study adversarial robustness in deep learning and approaches to achieve maximal adversarial robustness without training with additional data. ...
., empirical risk minimization on adversarial samples. ...
arXiv:2112.01423v1
fatcat:3yqco7htnjdbng4hx2ilkrnkaq
An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise
[article]
2020
arXiv
pre-print
Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in ...
We demonstrate that the learning performance of our method is more accurately captured by the structure of the covariance matrix of the noise rather than by the variance of gradients. ...
θ L(θ) here is the empirical risk minimizer. ...
arXiv:1902.08234v4
fatcat:656pntkmmnhcldlmtgzxktoniy
Variance Reduction with Sparse Gradients
[article]
2020
arXiv
pre-print
With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. ...
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. ...
The results of our experiments on natural language processing and matrix factorization demonstrate that, with additional effort, variance reduction methods are competitive with SGD. ...
arXiv:2001.09623v1
fatcat:gnkljc4ur5d75ecfialyggwtja
« Previous
Showing results 1 — 15 out of 1,127 results