Filters








2,951 Hits in 5.2 sec

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling [article]

Qi Meng, Wei Chen, Yue Wang, Zhi-Ming Ma, Tie-Yan Liu
2017 arXiv   pre-print
Second, we conduct the convergence analysis for SGD with local shuffling.  ...  When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines  ...  Convergence analysis of distributed SGD with global shuffling In this section, we will analyze the convergence rate of distributed SGD with global shuffling, for both convex and non-convex cases.  ... 
arXiv:1709.10432v1 fatcat:ggivzfkeqbfgzdsc7zmxz3cywe

Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs [article]

Xunpeng Huang, Hao Zhou, Runxin Xu, Zhe Wang, Lei Li
2020 arXiv   pre-print
In this paper, we provide adaptive gradient methods a novel analysis with an additional mild assumption, and revise AdaGrad to for matching a better provable convergence rate.  ...  Õ(T^-1/6) compared with existing adaptive gradient methods and random shuffling SGD, respectively.  ...  Let ∇F (x, ζ) denote the stochastic gradient of f (x). The finite-sum objective is a special case of Eq. 20 where f (x) with finite sampled stochastic variables ζ.  ... 
arXiv:2006.07037v1 fatcat:7euj3etwobaa3dairffzpdct2a

Stochastic Gradient Descent Tricks [chapter]

Léon Bottou
2012 Lecture Notes in Computer Science  
Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD).  ...  The Convergence of Stochastic Gradient Descent The convergence of stochastic gradient descent has been studied extensively in the stochastic approximation literature.  ...  The convergence speed of stochastic gradient descent is in fact limited by the noisy approximation of the true gradient.  ... 
doi:10.1007/978-3-642-35289-8_25 fatcat:t6dwe6cw7vfy5cptuxvpp3tuxq

In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang
2022 Proceedings of the 2022 International Conference on Management of Data  
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems.  ...  Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed.  ...  INTRODUCTION Stochastic gradient descent (SGD) is the cornerstone of modern ML systems.  ... 
doi:10.1145/3514221.3526150 fatcat:357ayllgfbdrfn3rdv2u4563oe

A block-random algorithm for learning on distributed, heterogeneous data [article]

Prakash Mohan, Marc T. Henry de Frahan, Ryan King, Ray W. Grout
2019 arXiv   pre-print
We present block-random gradient descent, a new algorithm that works on distributed, heterogeneous data without having to pre-shuffle.  ...  The randomization of the data prior to processing in batches that is formally required for stochastic gradient descent algorithm to effectively derive a useful deep learning model is expected to be prohibitively  ...  Algorithm 1 : 1 Stochastic gradient descent with mini-batching Parameters: learning rate η, batch size n b , number of epochs n e Input: training data with N samples while i ≤ n e do randomly shuffle data  ... 
arXiv:1903.00091v1 fatcat:ittwx4ejhveivfprjmti3xjhlu

Gradient Descent based Optimization Algorithms for Deep Learning Models Training [article]

Jiawei Zhang
2019 arXiv   pre-print
In back propagation, the model variables will be updated iteratively until convergence with gradient descent based optimization algorithms.  ...  Besides the conventional vanilla gradient descent algorithm, many gradient descent variants have also been proposed in recent years to improve the learning performance, including Momentum, Adagrad, Adam  ...  Even though the learning process of stochastic gradient descent may fluctuate a lot, which actually also provides stochastic gradient descent with the ability to jump out of local optimum.  ... 
arXiv:1903.03614v1 fatcat:hax6xb46hvg5hnhd7ggdobpgve

The Geometry of Sign Gradient Descent [article]

Lukas Balles and Fabian Pedregosa and Nicolas Le Roux
2020 arXiv   pre-print
Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the ℓ_∞-norm.  ...  We then proceed to study the smoothness constant with respect to the ℓ_∞-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods.  ...  Lukas Balles kindly acknowledges the support of the International Max Planck Research School for Intelligent Systems (IMPRS-IS) as well as financial support by the European Research Council through ERC  ... 
arXiv:2002.08056v1 fatcat:uakvuoahbzh5disayo3x7lwxcm

Efficient Distributed Semi-Supervised Learning using Stochastic Regularization over Affinity Graphs [article]

Sunil Thulasidasan, Jeffrey Bilmes, Garrett Kenyon
2018 arXiv   pre-print
We utilize a technique, first described in [13] for the construction of mini-batches for stochastic gradient descent (SGD) based on synthesized partitions of an affinity graph that are consistent with  ...  the graph structure, but also preserve enough stochasticity for convergence of SGD to good local minima.  ...  The methods presented were heuristically motivated; for our current research we are looking at further analysis, asynchronous versions of SGD and more provably optimal methods for constructing meta-batches  ... 
arXiv:1612.04898v2 fatcat:nfg5dxkuhreltdpihlpon5e36i

Stochastic Training is Not Necessary for Generalization [article]

Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein
2022 arXiv   pre-print
To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline.  ...  In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures.  ...  Theoretical analysis of gradient clipping for GD in Zhang et al. (2019b) and supports these findings, where it is shown that clipped descent algorithms can converge faster than unclipped algorithms  ... 
arXiv:2109.14119v2 fatcat:izkob2pvcfefhaqospgdzjnr7e

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent [article]

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya
2018 arXiv   pre-print
their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during  ...  In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems.  ...  An important type of Gradient Descent is Batch/Stochastic Gradient Descent (SGD) -where a random subset of samples are used for iterative feed-forward (calculation of predicted value) and back-propagation  ... 
arXiv:1803.05880v1 fatcat:tun5qumqbvbjhay4q2dwbzdyxi

Distributed Random Reshuffling over Networks [article]

Kun Huang, Xiao Li, Andre Milzarek, Shi Pu, Junwen Qiu
2022 arXiv   pre-print
To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that combines the classical distributed gradient descent (DGD) method and Random Reshuffling (RR).  ...  These convergence results match those of centralized RR (up to constant factors).  ...  The distributed implementation of stochastic gradient methods over networks, including using vanilla distributed stochastic gradient descent (DSGD) and more advanced methods, has been shown to achieve  ... 
arXiv:2112.15287v2 fatcat:ue65yuss7fcxthuvyph7qyqpse

Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms [article]

Janis Keuper, Franz-Josef Pfreundt
2015 arXiv   pre-print
In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of convergence and accuracy.  ...  Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent (ASGD) provides faster (or at least equal) convergence, close to linear scaling and stable accuracy.  ...  This changed with the presentation of a generic Figure 1 : Evaluation of the scaling properties of different parallel gradient descent algorithms for machine learning applications on distributed memory  ... 
arXiv:1505.04956v5 fatcat:rhrjhdl6jvg5xdn2wcyjqxwsva

Differentially Private Learning Needs Hidden State (Or Much Faster Convergence) [article]

Jiayuan Ye, Reza Shokri
2022 arXiv   pre-print
In this paper, we extend this hidden-state analysis to the noisy mini-batch stochastic gradient descent algorithms on strongly-convex smooth loss functions.  ...  Our converging privacy analysis, thus, shows that differentially private learning, with a tight bound, needs hidden state privacy analysis or a fast convergence.  ...  Moreover, our bound significantly improves over the previous converging privacy dynamics bound for noisy gradient descent [14] combined with the naive analysis of post-processing property of Rényi divergence  ... 
arXiv:2203.05363v1 fatcat:oxbi5a7d6japfjkl5l7cpgsgkm

Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms [article]

Janis Keuper, Franz-Josef Pfreundt
2015 arXiv   pre-print
Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms.  ...  Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems.  ...  This is usually done by a gradient descent over the partial derivative of the loss function in the parameter space of w. Stochastic Gradient Descent.  ... 
arXiv:1510.01155v1 fatcat:y3eu66h5ircmbk7s3lujszeiwi

Optimizing Machine Learning on Apache Spark in HPC Environments

Zhenyu Li, James Davis, Stephen A. Jarvis
2018 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)  
To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based allreduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent  ...  We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method.  ...  A comparable convergence rate with the new asynchronous stochastic gradient descent algorithm with respect to the synchronous method, and faster convergence with a larger batch size; (V) An estimated 2x  ... 
doi:10.1109/mlhpc.2018.8638643 fatcat:r2wutc5fzvbpzg6zrqoqe5erna
« Previous Showing results 1 — 15 out of 2,951 results