Filters








2,081 Hits in 3.6 sec

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML [article]

Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho, Eric P. Xing
2013 arXiv   pre-print
The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling.  ...  In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness.  ...  Acknowledgments We thank PRObE [4] and CMU PDL Consortium for providing testbed and technical support for our experiments.  ... 
arXiv:1312.7869v2 fatcat:2af2ztw4mneibbud2wk2u6pdca

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho, Eric P Xing
2018
The proposed consistency models are implemented in a distributed parameter server and evaluated in the context of a popular ML application: topic modeling.  ...  In this paper, we present several relaxed consistency models for asynchronous parallel computation and theoretically prove their algorithmic correctness.  ...  Acknowledgments We thank PRObE [4] and CMU PDL Consortium for providing testbed and technical support for our experiments.  ... 
doi:10.1184/r1/6475529.v1 fatcat:uglj4v74njd7dd4ryewfx7373m

How to scale distributed deep learning? [article]

Peter H. Jin, Qiaochu Yuan, Forrest Iandola, Kurt Keutzer
2016 arXiv   pre-print
In asynchronous approaches using parameter servers, training is slowed by contention to the parameter server.  ...  While a number of approaches have been proposed for distributed stochastic gradient descent (SGD), at the current time synchronous approaches to distributed SGD appear to be showing the greatest performance  ...  D'Azevedo, and Chris Fuson for helping make this work possible, as well as Josh Tobin for insightful discussions. We would also like to thank the anonymous reviewers for their constructive feedback.  ... 
arXiv:1611.04581v1 fatcat:wwbcp6ptbvc75gybuzajptcfhm

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B Gibbons, Garth A Gibson, Gregory R Ganger, Eric P Xing
2013 Advances in Neural Information Processing Systems  
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML  ...  The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale  ...  This work is supported in part by Intel via the Intel Science and Technology Center for Cloud Computing (ISTC-CC) and hardware donations from Intel and NetApp.  ... 
pmid:25400488 pmcid:PMC4230489 fatcat:7zsk6nl6ibhwfe3ukmsuipy2xe

Managed communication and consistency for fast data-parallel iterative analytics

Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing
2015 Proceedings of the Sixth ACM Symposium on Cloud Computing - SoCC '15  
While data-parallel ML applications often employ a loose consistency model when updating shared model parameters to maximize parallelism, the accumulated error may seriously impact the quality of refinements  ...  At the core of Machine Learning (ML) analytics is often an expert-suggested model, whose parameters are refined by iteratively processing a training dataset until convergence.  ...  We thank Mu Li, Jin Kyu Kim, Aaron Harlap, Xun Zheng and Zhiting Hu for their suggestions and help with setting up other third-party systems for comparison.  ... 
doi:10.1145/2806777.2806778 dblp:conf/cloud/WeiDQHCGGGX15 fatcat:mgqx7iwlare3tciivbyszgk5oq

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools [article]

Ruben Mayer, Hans-Arno Jacobsen
2019 arXiv   pre-print
In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures.  ...  This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data.  ...  The Parallel ML System (PMLS) uses BÃűsen [187] , a bounded-asynchronous parameter server. However, PMLS and BÃűsen are no longer actively developed.  ... 
arXiv:1903.11314v2 fatcat:y62z7mteyzeq5kenb7srwtlg7q

Byzantine Fault Tolerance in Distributed Machine Learning : a Survey [article]

Djamila Bouhata, Hamouma Moumen
2022 arXiv   pre-print
Byzantine Fault Tolerance (BFT) is among the most challenging problems in Distributed Machine Learning (DML).  ...  for scaling up ML [11] .  ...  Centralized setting: the centralized setting is the classical one of the distributed machine learning paradigm, consisting of the parameter server model [89] , where there is a central node computing  ... 
arXiv:2205.02572v1 fatcat:h2hkcgz3w5cvrnro6whl2rpvby

Asynchronous Federated Learning with Differential Privacy for Edge Intelligence [article]

Yanan Li, Shusen Yang, Xuebin Ren, Cong Zhao
2019 arXiv   pre-print
Particularly, with consideration of the heterogeneity in practical edge computing systems, asynchronous edge-cloud collaboration based federated learning can further improve the learning efficiency by  ...  Despite no raw data sharing, the open architecture and extensive collaborations of asynchronous federated learning (AFL) still give some malicious participants great opportunities to infer other parties  ...  Compared with distributed ML in the Cloud server, FL relies on a large number of heterogeneous edge devices/servers, which would have heterogeneous training progress and cause severe delays for the collaborative  ... 
arXiv:1912.07902v1 fatcat:p3fbiogznzfq3cplmv34cc2izy

Dynamic Parameter Allocation in Parameter Servers [article]

Alexander Renz-Wieland, Rainer Gemulla, Steffen Zeuch, Volker Markl
2020 arXiv   pre-print
Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead.  ...  We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training.  ...  Parameter management is thus a key concern in distributed ML.  ... 
arXiv:2002.00655v2 fatcat:i537kujvmbhv5paucqa2oks5p4

Genuinely Distributed Byzantine Machine Learning [article]

El-Mahdi El-Mhamdi and Rachid Guerraoui and Arsany Guirguis and Lê Nguyên Hoang and Sébastien Rouault
2020 arXiv   pre-print
Machine Learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model.  ...  We show that this problem can be solved in an asynchronous system, despite the presence of 1/3 Byzantine parameter servers and 1/3 Byzantine workers (which is optimal).  ...  KEYWORDS distributed machine learning, Byzantine fault tolerance, Byzantine parameter servers e fundamental problem addressed here is induced by the multiplicity of servers and consists of bounding the  ... 
arXiv:1905.03853v2 fatcat:u6irl56wsregref72p74napnka

Asynchronous Byzantine Machine Learning (the case of SGD) [article]

Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki
2018 arXiv   pre-print
Asynchronous distributed machine learning solutions have proven very effective so far, but always assuming perfectly functioning workers.  ...  We introduce Kardam, the first distributed asynchronous stochastic gradient descent (SGD) algorithm that copes with Byzantine workers.  ...  Since the communication is assumed to be asynchronous, the parameter server takes into account the first gradient received at time t.  ... 
arXiv:1802.07928v2 fatcat:lxvmg7tfvzgwroslyocm4agi2u

MALT

Hao Li, Asim Kadav, Erik Kruus, Cristian Ungureanu
2015 Proceedings of the Tenth European Conference on Computer Systems - EuroSys '15  
MALT provides abstractions for fine-grained in-memory updates using one-sided RDMA, limiting data movement costs during incremental model updates.  ...  Machine learning methods, such as SVM and neural networks, often improve their accuracy by using models with more parameters trained on large numbers of examples.  ...  We also thank Igor Durdanovic for helping us port RAPID to MALT and Hans-Peter Graf for his support and encouragement.  ... 
doi:10.1145/2741948.2741965 dblp:conf/eurosys/LiKKU15 fatcat:vczbxlmkm5gtdlp6gisf5ingby

Distributed Machine Learning via Sufficient Factor Broadcasting [article]

Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho, Abhimanu Kumar, Yaoliang Yu, Eric Xing
2015 arXiv   pre-print
To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following  ...  When these models are applied to large-scale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter  ...  SFB does not impose strong requirements on the distributed system -it can be used with synchronous [11, 23, 38] , asynchronous [13, 2, 10] , and bounded-asynchronous consistency models [5, 15, 31] ,  ... 
arXiv:1511.08486v1 fatcat:hum4kp3an5aprmna4uusoxuify

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning [article]

Xing Zhao and Aijun An and Junfeng Liu and Bao Xin Chen
2019 arXiv   pre-print
A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework.  ...  In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP)  ...  In a nutshell, the parameter server framework consists of a logic server and many workers. Workers are all connected to the server.  ... 
arXiv:1908.11848v1 fatcat:ta3pop7phjcb5esipaxf754pfq

ASAP: Asynchronous Approximate Data-Parallel Computation [article]

Asim Kadav, Erik Kruus
2016 arXiv   pre-print
In this paper, we present ASAP, a model that provides asynchronous and approximate processing semantics for data-parallel computation.  ...  In our results, we show that ASAP can reduce synchronization costs and provides 2-10X speedups in convergence and up to 10X savings in network costs for distributed machine learning applications and provides  ...  Acknowledgments We would like to thank Cun Mu for his help with the analysis of stochastic reduce convergence, and Igor Durdanovic for helping us port RAPID to MALT-2.  ... 
arXiv:1612.08608v1 fatcat:sy7t3mr6lrddjafm7anwp2j7vu
« Previous Showing results 1 — 15 out of 2,081 results