A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling
[article]
2017
arXiv
pre-print
Second, we conduct the convergence analysis for SGD with local shuffling. ...
When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines ...
Convergence analysis of distributed SGD with global shuffling In this section, we will analyze the convergence rate of distributed SGD with global shuffling, for both convex and non-convex cases. ...
arXiv:1709.10432v1
fatcat:ggivzfkeqbfgzdsc7zmxz3cywe
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs
[article]
2020
arXiv
pre-print
In this paper, we provide adaptive gradient methods a novel analysis with an additional mild assumption, and revise AdaGrad to for matching a better provable convergence rate. ...
Õ(T^-1/6) compared with existing adaptive gradient methods and random shuffling SGD, respectively. ...
Let ∇F (x, ζ) denote the stochastic gradient of f (x). The finite-sum objective is a special case of Eq. 20 where f (x) with finite sampled stochastic variables ζ. ...
arXiv:2006.07037v1
fatcat:7euj3etwobaa3dairffzpdct2a
Stochastic Gradient Descent Tricks
[chapter]
2012
Lecture Notes in Computer Science
Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). ...
The Convergence of Stochastic Gradient Descent The convergence of stochastic gradient descent has been studied extensively in the stochastic approximation literature. ...
The convergence speed of stochastic gradient descent is in fact limited by the noisy approximation of the true gradient. ...
doi:10.1007/978-3-642-35289-8_25
fatcat:t6dwe6cw7vfy5cptuxvpp3tuxq
In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle
2022
Proceedings of the 2022 International Conference on Management of Data
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. ...
Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. ...
INTRODUCTION Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. ...
doi:10.1145/3514221.3526150
fatcat:357ayllgfbdrfn3rdv2u4563oe
A block-random algorithm for learning on distributed, heterogeneous data
[article]
2019
arXiv
pre-print
We present block-random gradient descent, a new algorithm that works on distributed, heterogeneous data without having to pre-shuffle. ...
The randomization of the data prior to processing in batches that is formally required for stochastic gradient descent algorithm to effectively derive a useful deep learning model is expected to be prohibitively ...
Algorithm 1 : 1 Stochastic gradient descent with mini-batching Parameters: learning rate η, batch size n b , number of epochs n e Input: training data with N samples while i ≤ n e do randomly shuffle data ...
arXiv:1903.00091v1
fatcat:ittwx4ejhveivfprjmti3xjhlu
Gradient Descent based Optimization Algorithms for Deep Learning Models Training
[article]
2019
arXiv
pre-print
In back propagation, the model variables will be updated iteratively until convergence with gradient descent based optimization algorithms. ...
Besides the conventional vanilla gradient descent algorithm, many gradient descent variants have also been proposed in recent years to improve the learning performance, including Momentum, Adagrad, Adam ...
Even though the learning process of stochastic gradient descent may fluctuate a lot, which actually also provides stochastic gradient descent with the ability to jump out of local optimum. ...
arXiv:1903.03614v1
fatcat:hax6xb46hvg5hnhd7ggdobpgve
The Geometry of Sign Gradient Descent
[article]
2020
arXiv
pre-print
Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the ℓ_∞-norm. ...
We then proceed to study the smoothness constant with respect to the ℓ_∞-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods. ...
Lukas Balles kindly acknowledges the support of the International Max Planck Research School for Intelligent Systems (IMPRS-IS) as well as financial support by the European Research Council through ERC ...
arXiv:2002.08056v1
fatcat:uakvuoahbzh5disayo3x7lwxcm
Efficient Distributed Semi-Supervised Learning using Stochastic Regularization over Affinity Graphs
[article]
2018
arXiv
pre-print
We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with ...
the graph structure, but also preserve enough stochasticity for convergence of SGD to good local minima. ...
The methods presented were heuristically motivated; for our current research we are looking at further analysis, asynchronous versions of SGD and more provably optimal methods for constructing meta-batches ...
arXiv:1612.04898v2
fatcat:nfg5dxkuhreltdpihlpon5e36i
Stochastic Training is Not Necessary for Generalization
[article]
2022
arXiv
pre-print
To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. ...
In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. ...
Theoretical analysis of gradient clipping for GD in Zhang et al. (2019b) and supports these findings, where it is shown that clipped descent algorithms can converge faster than unclipped algorithms ...
arXiv:2109.14119v2
fatcat:izkob2pvcfefhaqospgdzjnr7e
GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent
[article]
2018
arXiv
pre-print
their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during ...
In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. ...
An important type of Gradient Descent is Batch/Stochastic Gradient Descent (SGD) -where a random subset of samples are used for iterative feed-forward (calculation of predicted value) and back-propagation ...
arXiv:1803.05880v1
fatcat:tun5qumqbvbjhay4q2dwbzdyxi
Distributed Random Reshuffling over Networks
[article]
2022
arXiv
pre-print
To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that combines the classical distributed gradient descent (DGD) method and Random Reshuffling (RR). ...
These convergence results match those of centralized RR (up to constant factors). ...
The distributed implementation of stochastic gradient methods over networks, including using vanilla distributed stochastic gradient descent (DSGD) and more advanced methods, has been shown to achieve ...
arXiv:2112.15287v2
fatcat:ue65yuss7fcxthuvyph7qyqpse
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
[article]
2015
arXiv
pre-print
In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of convergence and accuracy. ...
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent (ASGD) provides faster (or at least equal) convergence, close to linear scaling and stable accuracy. ...
This changed with the presentation of a generic Figure 1 : Evaluation of the scaling properties of different parallel gradient descent algorithms for machine learning applications on distributed memory ...
arXiv:1505.04956v5
fatcat:rhrjhdl6jvg5xdn2wcyjqxwsva
Differentially Private Learning Needs Hidden State (Or Much Faster Convergence)
[article]
2022
arXiv
pre-print
In this paper, we extend this hidden-state analysis to the noisy mini-batch stochastic gradient descent algorithms on strongly-convex smooth loss functions. ...
Our converging privacy analysis, thus, shows that differentially private learning, with a tight bound, needs hidden state privacy analysis or a fast convergence. ...
Moreover, our bound significantly improves over the previous converging privacy dynamics bound for noisy gradient descent [14] combined with the naive analysis of post-processing property of Rényi divergence ...
arXiv:2203.05363v1
fatcat:oxbi5a7d6japfjkl5l7cpgsgkm
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
[article]
2015
arXiv
pre-print
Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. ...
Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. ...
This is usually done by a gradient descent over the partial derivative of the loss function in the parameter space of w. Stochastic Gradient Descent. ...
arXiv:1510.01155v1
fatcat:y3eu66h5ircmbk7s3lujszeiwi
Optimizing Machine Learning on Apache Spark in HPC Environments
2018
2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based allreduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent ...
We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. ...
A comparable convergence rate with the new asynchronous stochastic gradient descent algorithm with respect to the synchronous method, and faster convergence with a larger batch size; (V) An estimated 2x ...
doi:10.1109/mlhpc.2018.8638643
fatcat:r2wutc5fzvbpzg6zrqoqe5erna
« Previous
Showing results 1 — 15 out of 2,951 results