29 Hits in 3.0 sec

Straggler Mitigation by Delayed Relaunch of Tasks [article]

Mehmet Fatih Aktas, Pei Peng, Emina Soljanin
2017 arXiv   pre-print
We show that delaying redundancy is not effective in reducing cost and that delayed relaunch of stragglers can yield significant reduction in cost and latency.  ...  We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks.  ...  Untimely relaunch may cause reduction in gain or even pain by either late relaunch and delayed cancellation of stragglers or early relaunch and killing non-stragglers.  ... 
arXiv:1710.00414v1 fatcat:ilwh4i5s5vgtbofvt7fr3t7x7u

Straggler Mitigation at Scale [article]

Mehmet Fatih Aktas, Emina Soljanin
2019 arXiv   pre-print
This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy.  ...  3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together?  ...  REDUNDANCY TOGETHER WITH RELAUNCH In this section, we consider employing redundant tasks and straggler relaunch jointly for straggler mitigation. A.  ... 
arXiv:1906.10664v2 fatcat:t62geuwic5bcjjhiffc5dft4mi

NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction [article]

Yi Ding, Avinash Rao, Hyebin Song, Rebecca Willett, Henry Hoffmann
2022 arXiv   pre-print
Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job.  ...  Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job.  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.  ... 
arXiv:2203.08339v1 fatcat:2wpyuyd2grd5bd6qwtgy7chrde


Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, Randy Katz
2014 Proceedings of the ACM Symposium on Cloud Computing - SOCC '14  
In particular, by using these predictions to balance delay in task scheduling against the potential for idling of resources, Wrangler achieves a speed up in the overall job completion time.  ...  Existing straggler mitigation techniques are inefficient due to their reactive and replicative nature -they rely on a wait-speculate-reexecute mechanism, thus leading to delayed straggler detection and  ...  This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and  ... 
doi:10.1145/2670979.2671005 dblp:conf/cloud/YadwadkarAK14 fatcat:rfwiymfinrdj3iq4446ogppvee

Slack Squeeze Coded Computing for Adaptive Straggler Mitigation [article]

Krishna Giri Narra, Zhifeng Lin, Mehrdad Kiamari, Salman Avestimehr, Murali Annavaram
2019 arXiv   pre-print
Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations.  ...  We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes.  ...  This material is based upon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract No.  ... 
arXiv:1904.07098v2 fatcat:6mj2ox3a5jezlpl2zt2vweehfi

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

Emre Ozfatura, Sennur Ulukus, Deniz Gündüz
2020 Entropy  
stragglers.  ...  Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by  ...  Author Contributions: Conceptualization Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/e22050544 pmid:33286316 pmcid:PMC7517046 fatcat:uw4mzulavvhbrh7cctqlitke5u

Optimization for Speculative Execution of Multiple Jobs in a MapReduce-like Cluster [article]

Huanle Xu, Wing Cheong Lau
2015 arXiv   pre-print
A parallel processing job can be delayed substantially as long as one of its many tasks is being assigned to a failing machine.  ...  We show that the ESE algorithm can beat the Mantri baseline scheme by 18% in terms of job flowtime while consuming the same amount of resource.  ...  Recently, [2] proposes to mitigate the straggler problem by cloning every small job and avoid the extra delay caused by the straggler monitoring/ detection process.  ... 
arXiv:1406.0609v3 fatcat:ept2yi2xabhs3ldn5r6heag2c4

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [article]

Emre Ozfatura, Sennur Ulukus, Deniz Gunduz
2020 arXiv   pre-print
and discarding partial computations carried out by stragglers.  ...  Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler  ...  To mitigate the stragglers each worker may perform some backup computations [5] - [7] , [21] , certain unfinished subtasks (slow workers) can be relaunched at the fast workers [10] , [11] , or some  ... 
arXiv:2004.04948v1 fatcat:vdeh3c2ibnc67m7mntwydr4qia

Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification [article]

Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, Murali Annavaram
2019 arXiv   pre-print
Deploying the collage-cnn models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication based approaches while providing high  ...  Variation in inference latency can be reduced by 1.8x to 15x.  ...  Approximation techniques ignore the results from tasks running on straggler nodes.  ... 
arXiv:1904.12222v2 fatcat:gxff2w46hnbjtkdeijqjrotzva

Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds [article]

Huanle Xu, Wing Cheong Lau
2015 arXiv   pre-print
The overall elapsed time of a job, i.e. the so-called flowtime, is often dictated by one or few slowly-running tasks within a job, generally referred as the "stragglers".  ...  The cause of stragglers include tasks running on partially/intermittently failing machines or the existence of some localized resource bottleneck(s) within a MapReduce cluster.  ...  To avoid the extra delay caused by the straggler detection, cloning approach was proposed in [2] .  ... 
arXiv:1501.02330v1 fatcat:n5hoiptz7benvp3jjqg32txnvq

Combating Computational Heterogeneity in Large-Scale Distributed Computing via Work Exchange [article]

Mohamed A. Attia, Ravi Tandon
2017 arXiv   pre-print
Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes  ...  One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion time due to the slowest worker.  ...  One approach is to efficiently detect the stragglers while running computational tasks, and then relaunch the delayed tasks on other machines [4, 5] .  ... 
arXiv:1711.08452v1 fatcat:nezkg4w6tbdb7ajw6we5fe7bqa

Chronos: A Unifying Optimization Framework for Speculative Execution of Deadline-critical MapReduce Jobs [article]

Maotong Xu, Sultan Alamro, Tian Lan, Suresh Subramaniam
2018 arXiv   pre-print
While a number of strategies have been developed in existing work to mitigate stragglers by launching speculative or clone task attempts, none of them provides a quantitative framework that optimizes the  ...  It has been shown that the execution times of MapReduce jobs are often adversely impacted by a few slow tasks, known as stragglers, which result in high latency and deadline violations.  ...  delay for relaunching (and rejuvenating) the original attempt at a new location.  ... 
arXiv:1804.05890v1 fatcat:io5yllvmvzdodhqv6pvabdcqne

Straggler Mitigation with Tiered Gradient Codes [article]

Shanuja Sasi, V. Lalitha, Vaneet Aggarwal, B. Sundar Rajan
2019 arXiv   pre-print
Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers.  ...  These techniques provide the flexibility that the job is complete when any k out of n servers finish their assigned tasks. The task size on each server is found based on the values of k and n.  ...  Both in [14] and [9] , the authors have showed that the delayed relaunch of stragglers yields significant reduction in cost and latency.  ... 
arXiv:1909.02516v1 fatcat:zwj4r44rfrcdrgo7dd3edepb3e

Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning [article]

Tingting Tang, Ramy E. Ali, Hanieh Hashemi, Tynan Gangwani, Salman Avestimehr, Murali Annavaram
2022 arXiv   pre-print
AVCC also speeds up the conventional uncoded implementation of distributed logistic regression by up to 7.6×, and improves the test accuracy by up to 12.1%.  ...  AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers.  ...  appear and then relaunch the learning, straggler mitigation, Byzantine robustness, privacy straggling task on another node, which delays the overall  ... 
arXiv:2107.12958v2 fatcat:f4zr6cymjray3mwdcin2dsyqoi

Heterogeneous MacroTasking (HeMT) for Parallel Processing in the Public Cloud [article]

Yuquan Shan, George Kesidis, Bhuvan Urgaonkar, Jorg Schad, Jalal Khamse-Ashari, Ioannis Lambadaris
2018 arXiv   pre-print
Using tiny, equal-sized tasks (Homogeneous microTasking, HomT) has long been regarded an effective way of load balancing in parallel computing systems.  ...  In this paper, we first analyze these advantages and disadvantages of HomT.  ...  Acknowledgements: This research was supported in part by NSF CNS 1717571 grant and a Cisco Systems URP gift.  ... 
arXiv:1810.00988v1 fatcat:e6cctyx6r5bj5h7o4uu3fhvvwm
« Previous Showing results 1 — 15 out of 29 results