99 Hits in 3.5 sec

A Survey of Coded Distributed Computing [article]

Jer Shyuan Ng, Wei Yang Bryan Lim, Nguyen Cong Luong, Zehui Xiong, Alia Asheralieva, Dusit Niyato, Cyril Leung, Chunyan Miao
2020 arXiv   pre-print
Distributed computing has become a common approach for large-scale computation of tasks due to benefits such as high reliability, scalability, computation speed, and costeffectiveness.  ...  Then, we review and analyze a number of CDC approaches proposed to reduce the communication costs, mitigate the straggler effects, and guarantee privacy and security.  ...  The cluster of computers is modelled as a master-worker system which consists of a single master node and multiple workers to store and analyzes massive amount of unstructured data.  ... 
arXiv:2008.09048v1 fatcat:riy4dxvuc5ae3krz7lf25zkg6m

Train Where the Data is: A Case for Bandwidth Efficient Coded Training [article]

Zhifeng Lin, Krishna Giri Narra, Mingchao Yu, Salman Avestimehr, Murali Annavaram
2019 arXiv   pre-print
Furthermore, coded computing traditionally relied on a central node to encode and distribute data to all the worker nodes, which is not practical in a distributed mobile setting.  ...  But there is a growing interest in enabling training near the data. For instance, mobile devices are rich sources of training data.  ...  Straggler Mitigation: Straggler mitigation in distributed computing has received considerable attention and many techniques have been proposed in the literature.  ... 
arXiv:1910.10283v1 fatcat:5u5evimcbjemrancv2bl6uhiby

Serverless Straggler Mitigation using Local Error-Correcting Codes [article]

Vipul Gupta, Dominic Carrano, Yaoqing Yang, Vaishaal Shankar, Thomas Courtade, Kannan Ramchandran
2020 arXiv   pre-print
We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning  ...  On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability  ...  In Fig. 4 , for example, only two blocks need to be read to mitigate a straggler.  ... 
arXiv:2001.07490v1 fatcat:ptbzh4ld3jezphosqkylgpadni

Coded Computation over Heterogeneous Clusters [article]

Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, Amir Salman Avestimehr
2019 arXiv   pre-print
We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation.  ...  There have been recent results that demonstrate the impact of coding for efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks  ...  The work in [16] proposes coding schemes for mitigating stragglers in distributed batch gradient computation.  ... 
arXiv:1701.05973v5 fatcat:wan745p6pbdbldnksc4ifn7bba

Efficient Replication for Straggler Mitigation in Distributed Computing [article]

Amir Behrouzi-Far, Emina Soljanin
2020 arXiv   pre-print
Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers.  ...  Finally, by running experiments on Google cluster traces, we observe that redundancy can reduce the compute time of the jobs in Google clusters by an order of magnitude, and that the optimum level of redundancy  ...  ACKNOWLEDGEMENT This research was supported in part by the NSF awards No. CIF-1717314 and CCF-1559855.  ... 
arXiv:2006.02318v2 fatcat:5vmx235oangghmu6uvoj4h7jpi

Combating Computational Heterogeneity in Large-Scale Distributed Computing via Work Exchange [article]

Mohamed A. Attia, Ravi Tandon
2017 arXiv   pre-print
levels is not available.  ...  We then present our approach of work exchange to combat the latency problem, in which faster workers can be reassigned additional leftover computations that were originally assigned to slower workers.  ...  At the core of the straggler problem is the heterogeneity of computation across the workers, i.e., different workers in the cluster may have different computational capabilities.  ... 
arXiv:1711.08452v1 fatcat:nezkg4w6tbdb7ajw6we5fe7bqa

Robust Gradient Descent via Moment Encoding with LDPC Codes [article]

Raj Kumar Maity, Ankit Singh Rawat, Arya Mazumdar
2019 arXiv   pre-print
To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation.  ...  The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system  ...  Acknowledgements This work is supported in part by National Science Foundation awards CCF 1642658 (CAREER) and CCF 1618512.  ... 
arXiv:1805.08327v2 fatcat:bghtp26hhjbutjnxx6jffb3e5q

Addressing the straggler problem for iterative convergent parallel ML

Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, Eric P. Xing
2016 Proceedings of the Seventh ACM Symposium on Cloud Computing - SoCC '16  
FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML).  ...  ., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others.  ...  This research is supported in part by Intel as part of the Intel Science and Technology Center for Cloud Computing (ISTC-CC), National Science Foundation under awards CNS-1042537, CCF-1533858, CNS-1042543  ... 
doi:10.1145/2987550.2987554 dblp:conf/cloud/HarlapCDWGGGX16 fatcat:ajh5kcppyrhxpnoqrmybkkly2i

Latency Analysis of Coded Computation Schemes over Wireless Networks [article]

Amirhossein Reisizadeh, Ramtin Pedarsani
2017 arXiv   pre-print
In particular, optimal coding schemes for minimizing latency in distributed computation of linear functions and mitigating the effect of stragglers was proposed for a wired network, where the workers can  ...  In this paper, we focus on the problem of coded computation over a wireless master-worker setup with straggling workers, where only one worker can transmit the result of its local computation back to the  ...  The traditional approach for mitigating these bottlenecks is to introduce computation redundancy in the form of task replicas.  ... 
arXiv:1707.00040v1 fatcat:ibkycb2rtrfhziwyv66tothr6a

Mitigate data skew caused stragglers through ImKP partition in MapReduce

Xue Ouyang, Huan Zhou, Stephen Clement, Paul Townend, Jie Xu
2017 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)  
Speculative execution is the mechanism adopted by current MapReduce framework when dealing with the straggler problem, and it functions through creating redundant copies for identified stragglers.  ...  In this paper, we focus on mitigating data skew caused Reduce stragglers, propose ImKP, an Intermediate Key Pre-processing framework that enables the even distributed partition for Reduce inputs.  ...  For Reduce skew handling approaches, Co-worker [10] functions in a way that as long as a straggler is identified, the reserved co-worker task will help process the remaining data.  ... 
doi:10.1109/pccc.2017.8280475 dblp:conf/ipccc/OuyangZCTX17 fatcat:ory4bvmpofdpfnzxnjcjumut2i


Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, Randy Katz
2014 Proceedings of the ACM Symposium on Cloud Computing - SOCC '14  
Existing modeling-based approaches are hard to rely on for production-level adoption due to modeling errors. We present Wrangler, a system that proactively avoids situations that cause stragglers.  ...  For production-level workloads from Facebook and Cloudera's customers, Wrangler improves the 99 th percentile job completion time by up to 61% as compared to speculative execution, a widely used straggler  ...  We also thank our shepherd, Fred Douglis, for help in shaping the final version of the paper.  ... 
doi:10.1145/2670979.2671005 dblp:conf/cloud/YadwadkarAK14 fatcat:rfwiymfinrdj3iq4446ogppvee

Speculative pipelining for compute cloud programming

H. T. Kung, Chit-Kwan Lin, Dario Vlah, Giovanni Berlanda Scorza
compute clouds.  ...  These phases can experience unpredictable delays when available computing and network capacities fluctuate or when there are large disparities in inter-node communication delays, as can occur on shared  ...  Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.  ... 
doi:10.1109/milcom.2010.5680451 fatcat:7ulyy646ljacjm4sapbtxbhqtu

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding [article]

Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos
2019 arXiv   pre-print
Gradient coded distributed GD uses redundancy to exactly recover the gradient at each iteration from a subset of compute nodes.  ...  We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding.  ...  and AWS Cloud Credits for Research from Amazon.  ... 
arXiv:1901.09671v1 fatcat:lkhtdq5lhjb3phyxswv5jblbem

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning [article]

Can Karakus, Yifan Sun, Suhas Diggavi, Wotao Yin
2018 arXiv   pre-print
We propose a distributed optimization framework where the dataset is "encoded" to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left  ...  Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation.  ...  Acknowledgments The work of Can Karakus and Suhas Diggavi was supported in part by NSF grants #1314937 and #1514531.  ... 
arXiv:1803.05397v1 fatcat:s7773b2nunbsnf6bpsyzazbuty

Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning [article]

Tingting Tang, Ramy E. Ali, Hanieh Hashemi, Tynan Gangwani, Salman Avestimehr, Murali Annavaram
2022 arXiv   pre-print
AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers.  ...  Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all three challenges.  ...  The workers must remain oblivious to the mitigating straggler effects and for tackling Byzantine nodes.  ... 
arXiv:2107.12958v2 fatcat:f4zr6cymjray3mwdcin2dsyqoi
« Previous Showing results 1 — 15 out of 99 results