A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Consistent Bounded-Asynchronous Parameter Servers for Distributed ML
[article]
2013
arXiv
pre-print
The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling. ...
In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness. ...
Acknowledgments We thank PRObE [4] and CMU PDL Consortium for providing testbed and technical support for our experiments. ...
arXiv:1312.7869v2
fatcat:2af2ztw4mneibbud2wk2u6pdca
Consistent Bounded-Asynchronous Parameter Servers for Distributed ML
2018
The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling. ...
In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness. ...
Acknowledgments We thank PRObE [4] and CMU PDL Consortium for providing testbed and technical support for our experiments. ...
doi:10.1184/r1/6475529.v1
fatcat:uglj4v74njd7dd4ryewfx7373m
How to scale distributed deep learning?
[article]
2016
arXiv
pre-print
In asynchronous approaches using parameter servers, training is slowed by contention to the parameter server. ...
While a number of approaches have been proposed for distributed stochastic gradient descent (SGD), at the current time synchronous approaches to distributed SGD appear to be showing the greatest performance ...
D'Azevedo, and Chris Fuson for helping make this work possible, as well as Josh Tobin for insightful discussions. We would also like to thank the anonymous reviewers for their constructive feedback. ...
arXiv:1611.04581v1
fatcat:wwbcp6ptbvc75gybuzajptcfhm
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
2013
Advances in Neural Information Processing Systems
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML ...
The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale ...
This work is supported in part by Intel via the Intel Science and Technology Center for Cloud Computing (ISTC-CC) and hardware donations from Intel and NetApp. ...
pmid:25400488
pmcid:PMC4230489
fatcat:7zsk6nl6ibhwfe3ukmsuipy2xe
Managed communication and consistency for fast data-parallel iterative analytics
2015
Proceedings of the Sixth ACM Symposium on Cloud Computing - SoCC '15
While data-parallel ML applications often employ a loose consistency model when updating shared model parameters to maximize parallelism, the accumulated error may seriously impact the quality of refinements ...
At the core of Machine Learning (ML) analytics is often an expert-suggested model, whose parameters are refined by iteratively processing a training dataset until convergence. ...
We thank Mu Li, Jin Kyu Kim, Aaron Harlap, Xun Zheng and Zhiting Hu for their suggestions and help with setting up other third-party systems for comparison. ...
doi:10.1145/2806777.2806778
dblp:conf/cloud/WeiDQHCGGGX15
fatcat:mgqx7iwlare3tciivbyszgk5oq
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
[article]
2019
arXiv
pre-print
In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. ...
This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data. ...
The Parallel ML System (PMLS) uses BÃűsen [187] , a bounded-asynchronous parameter server. However, PMLS and BÃűsen are no longer actively developed. ...
arXiv:1903.11314v2
fatcat:y62z7mteyzeq5kenb7srwtlg7q
Byzantine Fault Tolerance in Distributed Machine Learning : a Survey
[article]
2022
arXiv
pre-print
Byzantine Fault Tolerance (BFT) is among the most challenging problems in Distributed Machine Learning (DML). ...
for scaling up ML [11] . ...
Centralized setting: the centralized setting is the classical one of the distributed machine learning paradigm, consisting of the parameter server model [89] , where there is a central node computing ...
arXiv:2205.02572v1
fatcat:h2hkcgz3w5cvrnro6whl2rpvby
Asynchronous Federated Learning with Differential Privacy for Edge Intelligence
[article]
2019
arXiv
pre-print
Particularly, with consideration of the heterogeneity in practical edge computing systems, asynchronous edge-cloud collaboration based federated learning can further improve the learning efficiency by ...
Despite no raw data sharing, the open architecture and extensive collaborations of asynchronous federated learning (AFL) still give some malicious participants great opportunities to infer other parties ...
Compared with distributed ML in the Cloud server, FL relies on a large number of heterogeneous edge devices/servers, which would have heterogeneous training progress and cause severe delays for the collaborative ...
arXiv:1912.07902v1
fatcat:p3fbiogznzfq3cplmv34cc2izy
Dynamic Parameter Allocation in Parameter Servers
[article]
2020
arXiv
pre-print
Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. ...
We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. ...
Parameter management is thus a key concern in distributed ML. ...
arXiv:2002.00655v2
fatcat:i537kujvmbhv5paucqa2oks5p4
Genuinely Distributed Byzantine Machine Learning
[article]
2020
arXiv
pre-print
Machine Learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. ...
We show that this problem can be solved in an asynchronous system, despite the presence of 1/3 Byzantine parameter servers and 1/3 Byzantine workers (which is optimal). ...
KEYWORDS distributed machine learning, Byzantine fault tolerance, Byzantine parameter servers
e fundamental problem addressed here is induced by the multiplicity of servers and consists of bounding the ...
arXiv:1905.03853v2
fatcat:u6irl56wsregref72p74napnka
Asynchronous Byzantine Machine Learning (the case of SGD)
[article]
2018
arXiv
pre-print
Asynchronous distributed machine learning solutions have proven very effective so far, but always assuming perfectly functioning workers. ...
We introduce Kardam, the first distributed asynchronous stochastic gradient descent (SGD) algorithm that copes with Byzantine workers. ...
Since the communication is assumed to be asynchronous, the parameter server takes into account the first gradient received at time t. ...
arXiv:1802.07928v2
fatcat:lxvmg7tfvzgwroslyocm4agi2u
MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates. ...
Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples. ...
We also thank Igor Durdanovic for helping us port RAPID to MALT and Hans-Peter Graf for his support and encouragement. ...
doi:10.1145/2741948.2741965
dblp:conf/eurosys/LiKKU15
fatcat:vczbxlmkm5gtdlp6gisf5ingby
Distributed Machine Learning via Sufficient Factor Broadcasting
[article]
2015
arXiv
pre-print
To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following ...
When these models are applied to large-scale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter ...
SFB does not impose strong requirements on the distributed system -it can be used with synchronous [11, 23, 38] , asynchronous [13, 2, 10] , and bounded-asynchronous consistency models [5, 15, 31] , ...
arXiv:1511.08486v1
fatcat:hum4kp3an5aprmna4uusoxuify
Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning
[article]
2019
arXiv
pre-print
A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. ...
In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP) ...
In a nutshell, the parameter server framework consists of a logic server and many workers. Workers are all connected to the server. ...
arXiv:1908.11848v1
fatcat:ta3pop7phjcb5esipaxf754pfq
ASAP: Asynchronous Approximate Data-Parallel Computation
[article]
2016
arXiv
pre-print
In this paper, we present ASAP, a model that provides asynchronous and approximate processing semantics for data-parallel computation. ...
In our results, we show that ASAP can reduce synchronization costs and provides 2-10X speedups in convergence and up to 10X savings in network costs for distributed machine learning applications and provides ...
Acknowledgments We would like to thank Cun Mu for his help with the analysis of stochastic reduce convergence, and Igor Durdanovic for helping us port RAPID to MALT-2. ...
arXiv:1612.08608v1
fatcat:sy7t3mr6lrddjafm7anwp2j7vu
« Previous
Showing results 1 — 15 out of 2,081 results