A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Straggler Mitigation by Delayed Relaunch of Tasks
[article]
2017
arXiv
pre-print
We show that delaying redundancy is not effective in reducing cost and that delayed relaunch of stragglers can yield significant reduction in cost and latency. ...
We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. ...
Untimely relaunch may cause reduction in gain or even pain by either late relaunch and delayed cancellation of stragglers or early relaunch and killing non-stragglers. ...
arXiv:1710.00414v1
fatcat:ilwh4i5s5vgtbofvt7fr3t7x7u
Straggler Mitigation at Scale
[article]
2019
arXiv
pre-print
This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. ...
3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? ...
REDUNDANCY TOGETHER WITH RELAUNCH In this section, we consider employing redundant tasks and straggler relaunch jointly for straggler mitigation.
A. ...
arXiv:1906.10664v2
fatcat:t62geuwic5bcjjhiffc5dft4mi
NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction
[article]
2022
arXiv
pre-print
Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. ...
Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. ...
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies. ...
arXiv:2203.08339v1
fatcat:2wpyuyd2grd5bd6qwtgy7chrde
In particular, by using these predictions to balance delay in task scheduling against the potential for idling of resources, Wrangler achieves a speed up in the overall job completion time. ...
Existing straggler mitigation techniques are inefficient due to their reactive and replicative nature -they rely on a wait-speculate-reexecute mechanism, thus leading to delayed straggler detection and ...
This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and ...
doi:10.1145/2670979.2671005
dblp:conf/cloud/YadwadkarAK14
fatcat:rfwiymfinrdj3iq4446ogppvee
Slack Squeeze Coded Computing for Adaptive Straggler Mitigation
[article]
2019
arXiv
pre-print
Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. ...
We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. ...
This material is based upon work supported by Defense Advanced Research Projects Agency (DARPA) under Contract No. ...
arXiv:1904.07098v2
fatcat:6mj2ox3a5jezlpl2zt2vweehfi
Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
2020
Entropy
stragglers. ...
Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by ...
Author Contributions: Conceptualization
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/e22050544
pmid:33286316
pmcid:PMC7517046
fatcat:uw4mzulavvhbrh7cctqlitke5u
Optimization for Speculative Execution of Multiple Jobs in a MapReduce-like Cluster
[article]
2015
arXiv
pre-print
A parallel processing job can be delayed substantially as long as one of its many tasks is being assigned to a failing machine. ...
We show that the ESE algorithm can beat the Mantri baseline scheme by 18% in terms of job flowtime while consuming the same amount of resource. ...
Recently, [2] proposes to mitigate the straggler problem by cloning every small job and avoid the extra delay caused by the straggler monitoring/ detection process. ...
arXiv:1406.0609v3
fatcat:ept2yi2xabhs3ldn5r6heag2c4
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off
[article]
2020
arXiv
pre-print
and discarding partial computations carried out by stragglers. ...
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler ...
To mitigate the stragglers each worker may perform some backup computations [5] - [7] , [21] , certain unfinished subtasks (slow workers) can be relaunched at the fast workers [10] , [11] , or some ...
arXiv:2004.04948v1
fatcat:vdeh3c2ibnc67m7mntwydr4qia
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
[article]
2019
arXiv
pre-print
Deploying the collage-cnn models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication based approaches while providing high ...
Variation in inference latency can be reduced by 1.8x to 15x. ...
Approximation techniques ignore the results from tasks running on straggler nodes. ...
arXiv:1904.12222v2
fatcat:gxff2w46hnbjtkdeijqjrotzva
Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds
[article]
2015
arXiv
pre-print
The overall elapsed time of a job, i.e. the so-called flowtime, is often dictated by one or few slowly-running tasks within a job, generally referred as the "stragglers". ...
The cause of stragglers include tasks running on partially/intermittently failing machines or the existence of some localized resource bottleneck(s) within a MapReduce cluster. ...
To avoid the extra delay caused by the straggler detection, cloning approach was proposed in [2] . ...
arXiv:1501.02330v1
fatcat:n5hoiptz7benvp3jjqg32txnvq
Combating Computational Heterogeneity in Large-Scale Distributed Computing via Work Exchange
[article]
2017
arXiv
pre-print
Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes ...
One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion time due to the slowest worker. ...
One approach is to efficiently detect the stragglers while running computational tasks, and then relaunch the delayed tasks on other machines [4, 5] . ...
arXiv:1711.08452v1
fatcat:nezkg4w6tbdb7ajw6we5fe7bqa
Chronos: A Unifying Optimization Framework for Speculative Execution of Deadline-critical MapReduce Jobs
[article]
2018
arXiv
pre-print
While a number of strategies have been developed in existing work to mitigate stragglers by launching speculative or clone task attempts, none of them provides a quantitative framework that optimizes the ...
It has been shown that the execution times of MapReduce jobs are often adversely impacted by a few slow tasks, known as stragglers, which result in high latency and deadline violations. ...
delay for relaunching (and rejuvenating) the original attempt at a new location. ...
arXiv:1804.05890v1
fatcat:io5yllvmvzdodhqv6pvabdcqne
Straggler Mitigation with Tiered Gradient Codes
[article]
2019
arXiv
pre-print
Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. ...
These techniques provide the flexibility that the job is complete when any k out of n servers finish their assigned tasks. The task size on each server is found based on the values of k and n. ...
Both in [14] and [9] , the authors have showed that the delayed relaunch of stragglers yields significant reduction in cost and latency. ...
arXiv:1909.02516v1
fatcat:zwj4r44rfrcdrgo7dd3edepb3e
Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning
[article]
2022
arXiv
pre-print
AVCC also speeds up the conventional uncoded implementation of distributed logistic regression by up to 7.6×, and improves the test accuracy by up to 12.1%. ...
AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers. ...
appear and then relaunch the
learning, straggler mitigation, Byzantine robustness, privacy straggling task on another node, which delays the overall ...
arXiv:2107.12958v2
fatcat:f4zr6cymjray3mwdcin2dsyqoi
Heterogeneous MacroTasking (HeMT) for Parallel Processing in the Public Cloud
[article]
2018
arXiv
pre-print
Using tiny, equal-sized tasks (Homogeneous microTasking, HomT) has long been regarded an effective way of load balancing in parallel computing systems. ...
In this paper, we first analyze these advantages and disadvantages of HomT. ...
Acknowledgements: This research was supported in part by NSF CNS 1717571 grant and a Cisco Systems URP gift. ...
arXiv:1810.00988v1
fatcat:e6cctyx6r5bj5h7o4uu3fhvvwm
« Previous
Showing results 1 — 15 out of 29 results