1,127 Hits in 3.6 sec

SGD with Variance Reduction beyond Empirical Risk Minimization [article]

Massil Achab, Stéphane Gaïffas
2016 arXiv   pre-print
The proposed algorithm is doubly stochastic in the sense that gradient steps are done using stochastic gradient descent (SGD) with variance reduction, where the inner expectations are approximated by a  ...  Conclusion We have proposed a doubly stochastic gradient algorithm to extend SGD-like algorithms beyond the empirical risk minimization setting.  ...  This makes this setting quite different from the usual case of empirical risk minimization (linear regression, logistic regression, etc.), where all the gradients ∇f i share the same low numerical cost  ... 
arXiv:1510.04822v3 fatcat:bgktozbtjjfxbezoqkfcyq52zu

Hedging foreign exchange rate risk: Multi-currency diversification

Susana Álvarez-Díez, Eva Alfaro-Cid, Matilde O. Fernández-Blanco
2016 European Journal of Management and Business Economics  
The most widely used optimal hedge ratio is the so-called minimum-variance (MV) hedge ratio. This is a single objective problem where the risk, measured with the variance, is minimized.  ...  In the long scenario, the total reduction in VaR and CVaR ranges from 1% for SGD to 39.3% for AUD and from 3% for SGD to 38.4% for AUD with an average reduction of 26.4% and 23.9%.  ...  #h > 0: total number of currencies in the hedging portfolio with a long position; #h < 0: total number of currencies in the hedging portfolio with a short position; Total: risk reduction for minimum VaR  ... 
doi:10.1016/j.redee.2015.11.003 fatcat:qanefbwbcbbh3ktchpuvx5tmci

Reducing Runtime by Recycling Samples [article]

Jialei Wang, Hai Wang, Nathan Srebro
2016 arXiv   pre-print
Contrary to the situation with stochastic gradient descent, we argue that when using stochastic methods with variance reduction, such as SDCA, SAG or SVRG, as well as their variants, it could be beneficial  ...  We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal sample size one should use, and also uncover be-havior that suggests running SDCA for an integer number of epochs could be wasteful  ...  Preliminaries: SVM-Type Objectives and Stochastic Optimization Consider SVM-type training, where we learn a linear predictor by regularized empirical risk minimization with a convex loss (hinge loss for  ... 
arXiv:1602.02136v1 fatcat:ljvahxcxbfdbfbkzpc7mb2x76q

Accelerating Stochastic Gradient Descent Using Antithetic Sampling [article]

Jingchang Liu, Linli Xu
2018 arXiv   pre-print
But a rather high variance introduced by the stochastic gradient in each step may slow down the convergence.  ...  In this paper, we propose the antithetic sampling strategy to reduce the variance by taking advantage of the internal structure in dataset.  ...  And in recent years, SGD has been widely used to minimize the empirical risk in machine learning community [Shalev-Shwartz et al., 2007; Shamir and Zhang, 2013; Bottou et al., 2016] .  ... 
arXiv:1810.03124v1 fatcat:fwneo6dz6vfcra6uxs5y7rdvsy

Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics [article]

Stéphan Clémençon, Aurélien Bellet, Igor Colin
2016 arXiv   pre-print
data with low variance that take the form of averages over k-tuples.  ...  to as incomplete U-statistics, without damaging the O_P(1/√(n)) learning rate of Empirical Risk Minimization (ERM) procedures.  ...  Bellet was affiliated with Télécom ParisTech.  ... 
arXiv:1501.02629v4 fatcat:22kpgfipvjdfpcxx6mzuycqfdm

Scaling-up Distributed Processing of Data Streams for Machine Learning [article]

Matthew Nokleby, Haroon Raja, Waheed U. Bajwa
2020 arXiv   pre-print
For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data.  ...  In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that  ...  (SRM) and empirical risk minimization (ERM).  ... 
arXiv:2005.08854v2 fatcat:y6fvajvq2naajeqs6lo3trrgwy

On the Benefits of Invariance in Neural Networks [article]

Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, Benjamin Bloem-Reddy
2020 arXiv   pre-print
We prove that training with data augmentation leads to better estimates of risk and gradients thereof, and we provide a PAC-Bayes generalization bound for models trained with data augmentation.  ...  We provide empirical support of these theoretical results, including a demonstration of why generalization may not improve by training with data augmentation: the 'learned invariance' fails outside of  ...  Combined with (21) , the reduction in empirical augmented risk follows. The reduction in R • (Q, D n ) follows trivially.  ... 
arXiv:2005.00178v1 fatcat:45lmcynbjnertgapp6x2ok2yu4

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning [article]

Siyuan Ma, Raef Bassily, Mikhail Belkin
2018 arXiv   pre-print
Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.  ...  (b) SGD iteration with mini-batch m> m^* is nearly equivalent to a full gradient descent iteration (saturation regime).  ...  The table on the right compares the convergence of SGD in the interpolation setting with several popular variance reduction methods.  ... 
arXiv:1712.06559v3 fatcat:cgm2ieqksfa3bcza3zb3fgn52e

Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD [article]

Remi Bardenet, Subhro Ghosh, Meixia Lin
2021 arXiv   pre-print
When the number N of data items is large, SGD relies on constructing an unbiased estimator of the gradient of the empirical risk using a small subset of the original dataset, called a minibatch.  ...  Default minibatch construction involves uniformly sampling a subset of the desired size, but alternatives have been explored for variance reduction.  ...  This indicates that there is variance reduction beyond the change of the rate.  ... 
arXiv:2112.06007v1 fatcat:7zr3zigelzgzla6pzeqtkwjknu

Early Stopping without a Validation Set [article]

Maren Mahsereci, Lukas Balles, Christoph Lassner, Philipp Hennig
2017 arXiv   pre-print
It merely ensures that we do not minimize the empirical risk L D of a given model beyond the point of best generalization.  ...  Often there is easy access to the gradient of and gradient-based optimizers can be used to minimize the empirical risk.  ...  -Supplements- 5 Comparison to RMSPROP This Section explores the differences and similarities of SGD+EB-criterion and RMSPROP.  ... 
arXiv:1703.09580v3 fatcat:vpfieu2lcrdkla3s6vybmkgdve

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification [article]

Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford
2018 arXiv   pre-print
for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate.  ...  These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly  ...  the standard Empirical Risk Minimizer (or, Maximum Likelihood Estimator) (Lehmann and Casella, 1998; van der Vaart, 2000) .  ... 
arXiv:1610.03774v4 fatcat:7gzhgqawanbpndi4ztzipfktni

An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning [article]

Blake Woodworth, Nathan Srebro
2021 arXiv   pre-print
The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously.  ...  Cotter et al. (2011), which has suboptimal dependence on the minibatch size; and over the algorithm of Liu and Belkin (2018), which is limited to least squares problems and is also similarly suboptimal with  ...  Acknowledgements We thank Ohad Shamir for several helpful discussions in the process of preparing this article, and also George Lan for a conversation about optimization with bounded σ * .  ... 
arXiv:2106.02720v2 fatcat:23jfzoqpdrcmxfqttwtunx5bi4

Training Efficiency and Robustness in Deep Learning [article]

Fartash Faghri
2021 arXiv   pre-print
We formalize a simple trick called hard negative mining as a modification to the learning objective function with no computational overhead.  ...  Finally, we study adversarial robustness in deep learning and approaches to achieve maximal adversarial robustness without training with additional data.  ...  ., empirical risk minimization on adversarial samples.  ... 
arXiv:2112.01423v1 fatcat:3yqco7htnjdbng4hx2ilkrnkaq

An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise [article]

Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, Jimmy Ba
2020 arXiv   pre-print
Our empirical studies with standard deep learning model-architectures and datasets shows that our method not only improves generalization performance in large-batch training, but furthermore, does so in  ...  We demonstrate that the learning performance of our method is more accurately captured by the structure of the covariance matrix of the noise rather than by the variance of gradients.  ...  θ L(θ) here is the empirical risk minimizer.  ... 
arXiv:1902.08234v4 fatcat:656pntkmmnhcldlmtgzxktoniy

Variance Reduction with Sparse Gradients [article]

Melih Elibol, Lihua Lei, Michael I. Jordan
2020 arXiv   pre-print
With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity.  ...  Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients.  ...  The results of our experiments on natural language processing and matrix factorization demonstrate that, with additional effort, variance reduction methods are competitive with SGD.  ... 
arXiv:2001.09623v1 fatcat:gnkljc4ur5d75ecfialyggwtja
« Previous Showing results 1 — 15 out of 1,127 results