19 Hits in 4.0 sec

Serverless Straggler Mitigation using Local Error-Correcting Codes [article]

Vipul Gupta, Dominic Carrano, Yaoqing Yang, Vaishaal Shankar, Thomas Courtade, Kannan Ramchandran
2020 arXiv   pre-print
The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers.  ...  We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning  ...  STRAGGLER RESILIENCE IN SERVERLESS COMPUTING USING CODES A.  ... 
arXiv:2001.07490v1 fatcat:ptbzh4ld3jezphosqkylgpadni

OverSketched Newton: Fast Convex Optimization for Serverless Systems [article]

Vipul Gupta, Swanand Kadhe, Thomas Courtade, Michael W. Mahoney, Kannan Ramchandran
2020 arXiv   pre-print
These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures.  ...  Depending on whether the problem is strongly convex or not, we propose different iteration updates using the approximate Hessian.  ...  For straggler mitigation during gradient calculation, we use the recently proposed technique based on error-correcting codes to create redundant computation [33] [34] [35] .  ... 
arXiv:1903.08857v3 fatcat:luuf7mcdm5dk3l4bhpmf4khmq4

SeBS: A Serverless Benchmark Suite for Function-as-a-Service Computing [article]

Marcin Copik, Grzegorz Kwasniewski, Maciej Besta, Michal Podstawski, Torsten Hoefler
2021 arXiv   pre-print
However, the quickly moving technology hinders reproducibility, and the lack of a standardized benchmarking suite leads to ad-hoc solutions and microbenchmarks being used in serverless research, further  ...  To address this challenge, we propose the Serverless Benchmark Suite: the first benchmark for FaaS computing that systematically covers a wide spectrum of cloud resources and applications.  ...  Serverless functions pose new challenges due to a lack of code and data locality.  ... 
arXiv:2012.14132v2 fatcat:i6hehq4zqfht3f6ovuljvkzvsu

numpywren: serverless linear algebra [article]

Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, Jonathan Ragan-Kelley
2018 arXiv   pre-print
At the same time, we show that the inability of serverless runtimes to exploit locality across the cores in a machine fundamentally limits their network efficiency, which limits performance on other algorithms  ...  Linear algebra operations are widely used in scientific computing and machine learning applications.  ...  Straggler Mitigation: The lease mechanism also enables straggler mitigation by default. If a worker stalls or is slow, it can fail to renew a lease before it expires.  ... 
arXiv:1810.09679v1 fatcat:lxv65s42rzd6vomkg2pff2akhm

Cloudburst: Stateful Functions-as-a-Service [article]

Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Jose M. Faleiro, Joseph E. Gonzalez, Joseph M. Hellerstein, Alexey Tumanov
2020 arXiv   pre-print
Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality.  ...  Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular.  ...  If so, the cache returns the local version; otherwise, it queries the upstream cache for the correct version snapshot.  ... 
arXiv:2001.04592v2 fatcat:txil6f7nprcezlmc45qscw7voe

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication [article]

Ankur Mallick, Malhar Chaudhari, Utsav Sheth, Ganesh Palanikumar, Gauri Joshi
2019 arXiv   pre-print
We conduct experiments in three computing environments: local parallel computing, Amazon EC2, and Amazon Lambda, which show that rateless coding gives as much as 3× speed-up over uncoded schemes.  ...  Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes thus resulting in a lot of redundant computation.  ...  The rows of A are encoded using an error correcting code to give the m e × n encoded matrix A e , where m e ≥ m. We denote the amount of redundancy added by the parameter α = m e /m.  ... 
arXiv:1804.10331v5 fatcat:46a5afq3bfcazas72bdrnehbwi

Distributed Averaging Methods for Randomized Second Order Optimization [article]

Burak Bartan, Mert Pilanci
2020 arXiv   pre-print
Additionally, we demonstrate the implications of our theoretical findings via large scale experiments performed on a serverless computing platform.  ...  A possible solution to the issue of straggling nodes is to use error correcting codes to insert redundancy to computation and hence to avoid waiting for the outputs of all of the worker nodes (Lee et  ...  We identify that implementing straggler mitigation for solving large scale problems via approximate second order optimization methods such as distributed Newton sketch is a promising direction.  ... 
arXiv:2002.06540v1 fatcat:d4jwvyj5s5gk7bpymr6tj7urca

Optimizing Prediction Serving on Low-Latency Serverless Dataflow [article]

Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, Joseph M. Hellerstein
2020 arXiv   pre-print
We present the design of Cloudflow, a system that provides this API and realizes it on an autoscaling serverless backend.  ...  Similar to straggler mitigation in MapReduce [11] , competitive execution of inference has been shown to improve tail latencies for ML models [36, 53] . Operator Autoscaling and Placement.  ...  Similar to our Sagemaker deployments, we use custom code to move each request through the pipeline.  ... 
arXiv:2007.05832v1 fatcat:m3q5zhbfsbgcdijtm7stgyko7u

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration and Lower Bounds [article]

Burak Bartan, Mert Pilanci
2022 arXiv   pre-print
Additionally, we demonstrate the implications of our theoretical findings via large scale experiments on a serverless cloud computing platform.  ...  We leverage randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems.  ...  A possible solution to the issue of straggling nodes is to use error correcting codes to insert redundancy to computation and hence to avoid waiting for the outputs of all of the worker nodes [41] ,  ... 
arXiv:2203.09755v1 fatcat:7vwr4neazfg7higpowlhg2ztr4

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads [article]

John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, Guoqing Harry Xu
2021 arXiv   pre-print
Just using Lambdas on top of CPU servers offers up to 2.75x more performance-per-dollar than training only with CPU servers.  ...  Uniquely, Dorylus can take advantage of serverless computing to increase scalability at a low cost. The key insight guiding our design is computation separation.  ...  Compiling the Code. To build and synchronize the code on all nodes in the cluster run:local$ ./gnnman/setup-cluster local$ .  ... 
arXiv:2105.11118v2 fatcat:qzflntrg5bbqpfcmbu4xh6dpne

Autonomy and Intelligence in the Computing Continuum: Challenges, Enablers, and Future Directions for Orchestration [article]

Henna Kokkonen, Lauri Lovén, Naser Hossein Motlagh, Juha Partala, Alfonso González-Gil, Ester Sola, Iñigo Angulo, Madhusanka Liyanage, Teemu Leppänen, Tri Nguyen, Panos Kostakos, Mehdi Bennis (+4 others)
2022 arXiv   pre-print
In this article, we study orchestration in the device-edge-cloud continuum, and focus on AI for edge, that is, the AI methods used in resource orchestration.  ...  We claim that to support the constantly growing requirements of intelligent applications in the device-edge-cloud computing continuum, resource orchestration needs to embrace edge AI and emphasize local  ...  Consequently, MAML is limited to using very simple, shallow ANN architectures. To mitigate this issue, Rusu et al.  ... 
arXiv:2205.01423v1 fatcat:ul24ibzts5eutoozdh4y3n5kgq

Multi-tenant mobile offloading systems for real-time computer vision applications

Zhou Fang, Jeng-Hau Lin, Mani B. Srivastava, Rajesh K. Gupta
2019 Proceedings of the 20th International Conference on Distributed Computing and Networking - ICDCN '19  
In pursuit of low latency and high throughput, Clipper adopts caching, adaptive batching, and straggler mitigation techniques.  ...  It also presents straggler mitigation techniques to reduce the tail delay. These works do not consider co-locating RT and non-RT tasks, which is addressed in our work.  ...  Implementing run method for object detection microservice using data parallelization OBJECT DET PARALLEL. 1 # A microservice for object detection using data parallelization .  ... 
doi:10.1145/3288599.3288634 dblp:conf/icdcn/FangLS019 fatcat:qpib2wkm7jdnfg7k64eh3hwje4

Hints and Principles for Computer System Design [article]

Butler Lampson
2021 arXiv   pre-print
Two years later we built the Alto 2, using 4k RAM chips and error correction.  ...  If is too big (perhaps because the chance of corrupting a message bit is too great), you can make it smaller with forward error correction (an error-correcting code).  ... 
arXiv:2011.02455v3 fatcat:jolyz5lknjdbpjpxjcrx5rh6fa

Triggerflow: Trigger-based Orchestration of Serverless Workflows [article]

Pedro García-López, Aitor Arjona, Josep Sampe, Aleksander Slominski, Lionel Villard
2020 pre-print
We demonstrate that Triggerflow is a novel serverless building block capable of constructing different reactive schedulers (State Machines, Directed Acyclic Graphs, Workflow as code).  ...  We present Triggerflow: an extensible Trigger-based Orchestration architecture for serverless workflows built on top of Knative Eventing and Kubernetes technologies.  ...  Moreover, to use our event sourcing version of PyWren, it is not required any change in the user's code. is means that the code is completely portable between the local-machine and the Cloud, so users  ... 
doi:10.1145/3401025.3401731 arXiv:2006.08654v1 fatcat:b5onuunnvfbnrmy3lyx36hh35i

Improving the Efficiency of Heterogeneous Clouds

Michael Kaufmann
A mitigation strategy exists with task preemption, that is also used for straggler mitigation.  ...  Straggler mitigation Similar to the stale synchronous parallel (SSP) approach described in Section, uni-tasks exploits the auto-correcting property of ML training algorithms to mitigate the impact  ...  Both variants use the same local solver algorithm (SGD), hence lSGD with H=1 is mSGD. The number of training samples processed per iteration is K × H × L in both cases.  ... 
doi:10.5445/ir/1000117451 fatcat:ui4lxnlonjculcs7qimbo2dena
« Previous Showing results 1 — 15 out of 19 results