Filters








764 Hits in 4.4 sec

Efficient Straggler Replication in Large-scale Parallel Computing [article]

Da Wang, Gauri Joshi, Gregory Wornell
2017 arXiv   pre-print
In a cloud computing job with many parallel tasks, the tasks on the slowest machines (straggling tasks) become the bottleneck in the job completion.  ...  Despite being adopted in practice, there is little analysis of how replication affects the latency and the cost of additional computing resources.  ...  Related prior work The idea of replicating tasks in parallel computing has been recognized by system designers [10] , and first adopted at a large scale via the "backup tasks" in MapReduce [6] .  ... 
arXiv:1503.03128v3 fatcat:36xmen2rfveudmac62r3ed5oke

Discretized streams

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica
2013 Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles - SOSP '13  
D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers.  ...  Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers.  ...  This research was supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA8750-12-2-0331, a Google PhD Fellowship, and gifts from Amazon Web Services, Google, SAP, Cisco, Clearstory  ... 
doi:10.1145/2517349.2522737 dblp:conf/sosp/ZahariaDLHSS13 fatcat:tndskypg6rfhta755bt5q2tk3e

Parallel Performance of Molecular Dynamics Trajectory Analysis [article]

Mahzad Khoshlessan and Ioannis Paraskevakos and Geoffrey C. Fox and Shantenu Jha and Oliver Beckstein
2020 arXiv   pre-print
However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large  ...  Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a bottleneck for scaling.  ...  In the present study, we analyzed large MD trajectories in parallel with MPI and Python and observed large variations in the completion time of individual MPI ranks.  ... 
arXiv:1907.00097v3 fatcat:sxqpptlcrbgtdbx33uzcn2zbci

Proxy Responses by FPGA-Based Switch for MapReduce Stragglers

Koya MITSUZUKA, Michihiro KOIBUCHI, Hideharu AMANO, Hiroki MATSUTANI
2018 IEICE transactions on information and systems  
In parallel processing applications, a few worker nodes called "stragglers", which execute their tasks significantly slower than other tasks, increase the execution time of the job.  ...  In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes.  ...  A distributed scheduling algorithm for large-scale clusters is also proposed in [6] . However, it does not discuss how to monitor the tasks in parallel in detail.  ... 
doi:10.1587/transinf.2017edp7287 fatcat:zx4pmkyq4fh5hlju5i4y2hru3m

Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing

Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi
2013 Journal of Industrial and Management Optimization  
In cloud computing, a large-scale parallel-distributed processing service is provided where a huge task is split into a number of subtasks and those subtasks are processed on a cluster of machines called  ...  In such a processing service, a worker which takes a long time for processing a subtask makes the response time long (the issue of stragglers).  ...  In this paper, we considered the efficiency of backup-task scheduling in a large-scale parallel-distributed processing.  ... 
doi:10.3934/jimo.2014.10.113 fatcat:erm5xikryrerbmaxgokkfqo6q4

Mitigate data skew caused stragglers through ImKP partition in MapReduce

Xue Ouyang, Huan Zhou, Stephen Clement, Paul Townend, Jie Xu
2017 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)  
Although proved to be effective for contention caused stragglers, speculative execution can easily meet its bottleneck when mitigating data skew caused stragglers due to its replication nature: the identical  ...  The Map inputs are typically even in size according to the HDFS block configuration, therefore the skew caused stragglers happen mainly in the Reduce phase because of the unknown intermediate key distribution  ...  INTRODUCTION The MapReduce framework proposed in 2008 [1] has now become the de facto platform to support large-scale parallel processing and data analytics in production systems.  ... 
doi:10.1109/pccc.2017.8280475 dblp:conf/ipccc/OuyangZCTX17 fatcat:ory4bvmpofdpfnzxnjcjumut2i

Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

Shyam Deshmukh, Komati Thirupathi Rao, Mohammad Shabaz, Manjit Kaur
2021 Security and Communication Networks  
It performs rapid processing of tasks by subdividing them into tasks that execute in parallel.  ...  how to efficiently deal with mitigating stragglers without moving data to a centralized location.  ...  Spark is designed to efficiently scale up from one-to-many thousands of compute nodes.  ... 
doi:10.1155/2021/8340925 fatcat:7fe5onujrjbaveoykoqrcux7uu

Mitigating stragglers to avoid QoS violation for time-critical applications through dynamic server blacklisting

Xue Ouyang, Changjian Wang, Jie Xu
2019 Future generations computer systems  
The optimal k is investigated as a trade-off between capacity loss and straggler mitigation efficiency.  ...  As a result, no new tasks/replications are assigned to those straggler-prone nodes within the following time window.  ...  World Systems Stragglers are intensively discussed within the MapReduce [5] background as it is the most prominent parallel computing framework for processing large data sets within massive-scale clusters  ... 
doi:10.1016/j.future.2019.07.017 fatcat:fl7ojzrhtjcynbzkn52cgrizgu

Train Where the Data is: A Case for Bandwidth Efficient Coded Training [article]

Zhifeng Lin, Krishna Giri Narra, Mingchao Yu, Salman Avestimehr, Murali Annavaram
2019 arXiv   pre-print
In this paper, we tackle the uncertainty in distributed mobile training using a bandwidth-efficient encoding strategy.  ...  One proactive approach to tolerate computational uncertainty is to store data in a coded format and perform training on coded data.  ...  This strategy is known as coded computing. In coded computing, redundancy is added in an efficient coded form to make the computations robust to stragglers.  ... 
arXiv:1910.10283v1 fatcat:5u5evimcbjemrancv2bl6uhiby

Timely Long Tail Identification through Agent Based Monitoring and Analytics

Peter Garraghan, Xue Ouyang, Paul Townend, Jie Xu
2015 2015 IEEE 18th International Symposium on Real-Time Distributed Computing  
practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide.  ...  The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance.  ...  Acknowledgments The work is supported in part by the National Basic Research Program of China (973) (No.2011CB302602), the U.K. EPSRC WRG platform project (No.  ... 
doi:10.1109/isorc.2015.39 dblp:conf/isorc/GarraghanOTX15 fatcat:siudzi5ywrbuxc5sg6cyjevkeu

Scaling Neural Machine Translation [article]

Myle Ott and Sergey Edunov and David Grangier and Michael Auli
2018 arXiv   pre-print
On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.  ...  Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine.  ...  Training with large batches is less data-efficient, but can be parallelized. Batch sizes given in number of target tokens excluding padding.  ... 
arXiv:1806.00187v3 fatcat:pbm53ks7hvgwzfp6qf3fluqwku

Tolhit – A Scheduling Algorithm for Hadoop Cluster

M. Brahmwar, M. Kumar, G. Sikka
2016 Procedia Computer Science  
Apache Hadoop is the most prominent implementation of MapReduce, which is used for processing and analyses of such large scale data intensive applications in a highly scalable and fault tolerant manner  ...  In this work, a new scheme is introduced to aid the scheduler in identifying the nodes on which stragglers can be executed.  ...  In order to solve this crucial problem of analyzing such large data sets various computing paradigms such as grid computing and cloud computing came into existence.  ... 
doi:10.1016/j.procs.2016.06.043 fatcat:mpa7x4kifzdhzaiqnoqijcj6ii

Scaling Neural Machine Translation

Myle Ott, Sergey Edunov, David Grangier, Michael Auli
2018 Proceedings of the Third Conference on Machine Translation: Research Papers  
On the WMT'14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.  ...  Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine.  ...  Training with large batches is less data-efficient, but can be parallelized. Batch sizes given in number of target tokens excluding padding.  ... 
doi:10.18653/v1/w18-6301 dblp:conf/wmt/OttEGA18 fatcat:av5idbulwjatrcmfj5njd4crci

Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks

Wei Chen, Jia Rao, Xiaobo Zhou
2017 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments.  ...  The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs.  ...  This research was supported in part by U.S. NSF grants CNS-1422119, CNS-1649502 and IIS-1633753.  ... 
doi:10.1109/ipdps.2017.28 dblp:conf/ipps/ChenRZ17 fatcat:7alnu3wslbeo3eokaogftqvtoa

Client-side Straggler-Aware I/O Scheduler for Object-based Parallel File Systems [article]

Neda Tavakoli, Dong Dai, Yong Chen
2018 arXiv   pre-print
Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems.  ...  An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput.  ...  The storage server straggler problem can be catastrophic in projected extreme-scale systems, as the large-scale storage system significantly increases the possibility of the existence of a straggler in  ... 
arXiv:1805.06156v1 fatcat:tdzlqzp7hbekjlpjpsfutexjhy
« Previous Showing results 1 — 15 out of 764 results