23 Hits in 3.7 sec

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays [article]

Konstantin Mishchenko, Francis Bach, Mathieu Even, Blake Woodworth
2022 arXiv   pre-print
Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider.  ...  The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay.  ...  Some algorithms try to adapt to the delays, but even these are not proven to perform well under arbitrary delays [34, 52] .  ... 
arXiv:2206.07638v1 fatcat:efursasonjcm5oe2jum2ptbkza

On the Convergence Analysis of Asynchronous SGD for Solving Consistent Linear Systems [article]

Atal Narayan Sahu and Aritra Dutta and Aashutosh Tiwari and Peter Richtárik
2020 arXiv   pre-print
We compare the convergence rates of our asynchronous SGD algorithm with the synchronous parallel algorithm proposed by Richtárik and Takáč in [35] under different choices of the hyperparameters—the stepsize  ...  In this paper, we propose and analyze a distributed, asynchronous parallel SGD in light of solving an arbitrary consistent linear system by reformulating the system into a stochastic optimization problem  ...  [5] showed that for convex problems, under similar conditions as regular SGD, asynchronous SGD achieves similar asymptotic convergence rate.  ... 
arXiv:2004.02163v1 fatcat:5rdknx4hwjbmdox5ionzunydbi

Asynchronous Stochastic Gradient Descent with Delay Compensation [article]

Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, Tie-Yan Liu
2020 arXiv   pre-print
We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches  ...  Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients.  ...  algorithm called Delay Compensated Asynchronous SGD (DC-ASGD) to tackle the problem.  ... 
arXiv:1609.08326v6 fatcat:emhoqw6e4vhcjoc3i5gpoa3cfi

The Minimax Complexity of Distributed Optimization [article]

Blake Woodworth
2021 arXiv   pre-print
I provide the first guarantees for Local SGD that improve over simple baseline methods, but show that Local SGD is not optimal in general.  ...  Next, I describe a general approach to proving optimization lower bounds for arbitrary randomized algorithms (as opposed to more restricted classes of algorithms, e.g., deterministic or "zero-respecting  ...  Therefore, the delay graph might correspond to an asynchronous setting in which the algorithm issues queries to an oracle, but does not receive a response for τ time steps.  ... 
arXiv:2109.00534v1 fatcat:ibkwtyfd3bawzftakx7ebpwod4

High-Performance Distributed ML at Scale through Parameter Server Consistency Models [article]

Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, Eric P. Xing
2014 arXiv   pre-print
Then Theorem 5 . 5 (SGD under SSP, convergence in probability) Given convex function f Theorem 6 . 6 (SGD under SSP, decreasing variance) Given the setup in Theorem 5 and assumption 1-3.  ...  (SGD under VAP, convergence in expectation) Given convex function f (x) = T t=1 f t (x) such that components f t are also convex.  ...  Appendix Theorem 1 (SGD under VAP, convergence in expectation) Given convex function f (x) = T t=1 f t (x) such that components f t are also convex.  ... 
arXiv:1410.8043v1 fatcat:gymqox7auzewxezrn7e64vdcsa

Deep Learning at Scale with Nearest Neighbours Communications

Paolo Viviani, Marco Aldinucci
2019 Zenodo  
Moreover, in order to validate the proposed strategy, the Flexible Asynchronous Scalable Training (FAST) framework is introduced, which allows to apply the nearest-neighbours communications approach to  ...  They key feature of this point is that not all the gradients reach all the workers, not even after an arbitrary delay.  ...  SGD, a.k.a. large mini-batch approach. FIGURE 3. 5 : 5 Asynchronous SGD with parameter server.  ... 
doi:10.5281/zenodo.3516093 fatcat:5uk24pxanjhunlqb26r7j6aarm

Linearly Converging Error Compensated SGD [article]

Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, Peter Richtárik
2020 arXiv   pre-print
In this paper, we propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates.  ...  Our framework is general enough to cover different variants of quantized SGD, Error-Compensated SGD (EC-SGD) and SGD with delayed updates (D-SGD).  ...  H SGD with Delayed Updates In this section we consider the SGD with delayed updates (D-SGD) [1, 33, 10, 3, 45] .  ... 
arXiv:2010.12292v1 fatcat:uw7cgz7ysbctfbwr6wmbi4k66i

Analyzing the benefits of communication channels between deep learning models [article]

Philippe Lacaille
2019 arXiv   pre-print
The first approach studied looks at decentralizing the numerous computations that are done in parallel in training procedures such as synchronous and asynchronous stochastic gradient descent.  ...  and asynchronous SGD Distributed SGD refers to the widely used approach of data parallelism with large neural network models/datasets.  ...  Synchronous and asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.  ... 
arXiv:1904.09211v1 fatcat:qy7mk2g2yfantd2hxjqzqmazy4

On Seven Fundamental Optimization Challenges in Machine Learning [article]

Konstantin Mishchenko
2021 arXiv   pre-print
Unlike Local SGD, FedRR can provably beat gradient descent in communication complexity in the heterogeneous data regime. The fourth challenge is related to the class of adaptive methods.  ...  Our third contribution can be seen as a combination of our new theory for proximal RR and Local SGD yielding a new algorithm, which we call FedRR.  ...  This allows our algorithm FedRR to beat Local SGD after a certain number of iterations, regardless of how heterogeneous the data are. Paper.  ... 
arXiv:2110.12281v1 fatcat:c4oc7xv6fvdqdik4hwegrcnsqm

Stochastic, Distributed and Federated Optimization for Machine Learning [article]

Jakub Konečný
2017 arXiv   pre-print
We propose a communication-efficient framework which iteratively forms local subproblems that can be solved with arbitrary local optimization algorithms.  ...  However, CoCoA + can still beat other methods in running time.  ...  Hence, we can find the solution with less overall work when using a minibatch of size b than when using a minibatch of size 1.  ... 
arXiv:1707.01155v1 fatcat:t6uqrmnssrafze6l6c7gk5vcyu


Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, Eric P. Xing
2016 Proceedings of the Eleventh European Conference on Computer Systems - EuroSys '16  
This research is supported in part by Intel as part of the Intel Science and Technology Center for Cloud Computing (ISTC-CC), National Science Foundation under awards CNS-1042537 and CNS-1042543 (PRObE  ...  We observe the opposite with data-parallel workers executing on GPUs-while synchronization delays can be largely eliminated, as expected, convergence is much slower with the more asynchronous models because  ...  models, because the negative impact of staleness outweighs the benefits of reduced communication delays.  ... 
doi:10.1145/2901318.2901323 dblp:conf/eurosys/CuiZGGX16 fatcat:gp5fncsmvbgcll2whnz3y45ytm

The Faults in Our Pi Stars: Security Issues and Open Challenges in Deep Reinforcement Learning [article]

Vahid Behzadan, Arslan Munir
2018 arXiv   pre-print
The Asynchronous Advantage Actor-Critic (A3C) algorithm [32] is comprised of separate actor-learner threads that sample environment steps and update a centralized copy of the parameters asynchronously  ...  Unlike A3C, PPO performs multiple parameter updates using minibatches from each set of samples.  ... 
arXiv:1810.10369v1 fatcat:c2lgl3curvgmthsme5adjq5rza

Machine Learning for Microcontroller-Class Hardware – A Review [article]

Swapnil Sayan Saha, Sandeep Singh Sandha, Mani Srivastava
2022 arXiv   pre-print
ML-MCU [182] H: Optimized SGD (inherits stability of GD and efficiency ARM Cortex-M, Image recognition (MNIST), mHealth (Heart Optimized OVO of SGD); optimized one-versus-one (OVO) binary classifiers Espressif  ...  pruning after training with a single minibatch datapoints (Snip) [150] , change in gradient norm due to parameter pruning after training with a single minibatch datapoints (Grasp) [151] , the product  ... 
arXiv:2205.14550v3 fatcat:y272riitirhwfgfiotlwv5i7nu

Deep Dynamic Factor Models [article]

Paolo Andreini, Cosimo Izzo, Giovanni Ricco
2020 arXiv   pre-print
In an empirical application to the forecast and nowcast of economic conditions in the US, we show the potential of this framework in dealing with high dimensional, mixed frequencies and asynchronously  ...  These subsamples are called 'minibatches' and they are equal partition of the original training datasets. The computational cost of SGD algorithms is independent with respect to the sample size.  ...  For example, the RMSEs at 6 weeks refers to the RMSEs 6 weeks prior to the release date of the variable under consideration.  ... 
arXiv:2007.11887v1 fatcat:fzdnseahg5ekld6hxh3l5r2k34

Automatic Speech Recognition: Systematic Literature Review

Sadeen Alharbi, Muna Alrazgan, Alanoud Alrashed, Turkiah AlNomasi, Raghad Almojel, Rimah Alharbi, Saja Alharbi, Sahar Alturki, Fatimah Alshehri, Maha Almojil
2021 IEEE Access  
The experiments confirmed that the LSTM model could be trained by Asynchronous Decentralized Parallel SGD (ADPSGD) in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER.  ...  In [93] , the researchers found that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can run with a much larger batch size than the usually applied synchronous SGD (SSGD) algorithm  ...  This model can be improved in the following aspects: first, model delay: Attention-based models can effectively improve the recognition performance, but it is not monotonous and has long delay.  ... 
doi:10.1109/access.2021.3112535 fatcat:uhyhmyd6b5d2lldkhf6tihnxky
« Previous Showing results 1 — 15 out of 23 results