A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Efficient Straggler Replication in Large-scale Parallel Computing
[article]
2017
arXiv
pre-print
In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion. ...
Despite being adopted in practice, there is little analysis of how replication affects the latency and the cost of additional computing resources. ...
Related prior work The idea of replicating tasks in parallel computing has been recognized by system designers [10] , and first adopted at a large scale via the "backup tasks" in MapReduce [6] . ...
arXiv:1503.03128v3
fatcat:36xmen2rfveudmac62r3ed5oke
Discretized streams
2013
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP '13
D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. ...
Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. ...
This research was supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA8750-12-2-0331, a Google PhD Fellowship, and gifts from Amazon Web Services, Google, SAP, Cisco, Clearstory ...
doi:10.1145/2517349.2522737
dblp:conf/sosp/ZahariaDLHSS13
fatcat:tndskypg6rfhta755bt5q2tk3e
Parallel Performance of Molecular Dynamics Trajectory Analysis
[article]
2020
arXiv
pre-print
However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large ...
Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a bottleneck for scaling. ...
In the present study, we analyzed large MD trajectories in parallel with MPI and Python and observed large variations in the completion time of individual MPI ranks. ...
arXiv:1907.00097v3
fatcat:sxqpptlcrbgtdbx33uzcn2zbci
Proxy Responses by FPGA-Based Switch for MapReduce Stragglers
2018
IEICE transactions on information and systems
In parallel processing applications, a few worker nodes called "stragglers", which execute their tasks significantly slower than other tasks, increase the execution time of the job. ...
In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes. ...
A distributed scheduling algorithm for large-scale clusters is also proposed in [6] . However, it does not discuss how to monitor the tasks in parallel in detail. ...
doi:10.1587/transinf.2017edp7287
fatcat:zx4pmkyq4fh5hlju5i4y2hru3m
Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing
2013
Journal of Industrial and Management Optimization
In cloud computing, a large-scale parallel-distributed processing service is provided where a huge task is split into a number of subtasks and those subtasks are processed on a cluster of machines called ...
In such a processing service, a worker which takes a long time for processing a subtask makes the response time long (the issue of stragglers). ...
In this paper, we considered the efficiency of backup-task scheduling in a large-scale parallel-distributed processing. ...
doi:10.3934/jimo.2014.10.113
fatcat:erm5xikryrerbmaxgokkfqo6q4
Mitigate data skew caused stragglers through ImKP partition in MapReduce
2017
2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)
Although proved to be effective for contention caused stragglers, speculative execution can easily meet its bottleneck when mitigating data skew caused stragglers due to its replication nature: the identical ...
The Map inputs are typically even in size according to the HDFS block configuration, therefore the skew caused stragglers happen mainly in the Reduce phase because of the unknown intermediate key distribution ...
INTRODUCTION The MapReduce framework proposed in 2008 [1] has now become the de facto platform to support large-scale parallel processing and data analytics in production systems. ...
doi:10.1109/pccc.2017.8280475
dblp:conf/ipccc/OuyangZCTX17
fatcat:ory4bvmpofdpfnzxnjcjumut2i
Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework
2021
Security and Communication Networks
It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. ...
how to efficiently deal with mitigating stragglers without moving data to a centralized location. ...
Spark is designed to efficiently scale up from one-to-many thousands of compute nodes. ...
doi:10.1155/2021/8340925
fatcat:7fe5onujrjbaveoykoqrcux7uu
Mitigating stragglers to avoid QoS violation for time-critical applications through dynamic server blacklisting
2019
Future generations computer systems
The optimal k is investigated as a trade-off between capacity loss and straggler mitigation efficiency. ...
As a result, no new tasks/replications are assigned to those straggler-prone nodes within the following time window. ...
World Systems Stragglers are intensively discussed within the MapReduce [5] background as it is the most prominent parallel computing framework for processing large data sets within massive-scale clusters ...
doi:10.1016/j.future.2019.07.017
fatcat:fl7ojzrhtjcynbzkn52cgrizgu
Train Where the Data is: A Case for Bandwidth Efficient Coded Training
[article]
2019
arXiv
pre-print
In this paper, we tackle the uncertainty in distributed mobile training using a bandwidth-efficient encoding strategy. ...
One proactive approach to tolerate computational uncertainty is to store data in a coded format and perform training on coded data. ...
This strategy is known as coded computing. In coded computing, redundancy is added in an efficient coded form to make the computations robust to stragglers. ...
arXiv:1910.10283v1
fatcat:5u5evimcbjemrancv2bl6uhiby
Timely Long Tail Identification through Agent Based Monitoring and Analytics
2015
2015 IEEE 18th International Symposium on Real-Time Distributed Computing
practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide. ...
The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. ...
Acknowledgments The work is supported in part by the National Basic Research Program of China (973) (No.2011CB302602), the U.K. EPSRC WRG platform project (No. ...
doi:10.1109/isorc.2015.39
dblp:conf/isorc/GarraghanOTX15
fatcat:siudzi5ywrbuxc5sg6cyjevkeu
Scaling Neural Machine Translation
[article]
2018
arXiv
pre-print
On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs. ...
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. ...
Training with large batches is less data-efficient, but can be parallelized. Batch sizes given in number of target tokens excluding padding. ...
arXiv:1806.00187v3
fatcat:pbm53ks7hvgwzfp6qf3fluqwku
Tolhit – A Scheduling Algorithm for Hadoop Cluster
2016
Procedia Computer Science
Apache Hadoop is the most prominent implementation of MapReduce, which is used for processing and analyses of such large scale data intensive applications in a highly scalable and fault tolerant manner ...
In this work, a new scheme is introduced to aid the scheduler in identifying the nodes on which stragglers can be executed. ...
In order to solve this crucial problem of analyzing such large data sets various computing paradigms such as grid computing and cloud computing came into existence. ...
doi:10.1016/j.procs.2016.06.043
fatcat:mpa7x4kifzdhzaiqnoqijcj6ii
Scaling Neural Machine Translation
2018
Proceedings of the Third Conference on Machine Translation: Research Papers
On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs. ...
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. ...
Training with large batches is less data-efficient, but can be parallelized. Batch sizes given in number of target tokens excluding padding. ...
doi:10.18653/v1/w18-6301
dblp:conf/wmt/OttEGA18
fatcat:av5idbulwjatrcmfj5njd4crci
Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks
2017
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. ...
The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. ...
This research was supported in part by U.S. NSF grants CNS-1422119, CNS-1649502 and IIS-1633753. ...
doi:10.1109/ipdps.2017.28
dblp:conf/ipps/ChenRZ17
fatcat:7alnu3wslbeo3eokaogftqvtoa
Client-side Straggler-Aware I/O Scheduler for Object-based Parallel File Systems
[article]
2018
arXiv
pre-print
Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems. ...
An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput. ...
The storage server straggler problem can be catastrophic in projected extreme-scale systems, as the large-scale storage system significantly increases the possibility of the existence of a straggler in ...
arXiv:1805.06156v1
fatcat:tdzlqzp7hbekjlpjpsfutexjhy
« Previous
Showing results 1 — 15 out of 764 results