Filters








55 Hits in 7.9 sec

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination [article]

Fuxun Yu, Zirui Xu, Tong Shen, Dimitrios Stamoulis, Longfei Shangguan, Di Wang, Rishi Madhok, Chunshui Zhao, Xin Li, Nikolaos Karianakis, Dimitrios Lymberopoulos, Ang Li (+3 others)
2020 arXiv   pre-print
Motivated by this, we propose a GPU runtime-aware DNN optimization methodology to eliminate such GPU tail effect adaptively on GPU platforms.  ...  In this work, we show that the mismatch between the varied DNN computation workloads and GPU capacity can cause the idle GPU tail effect, leading to GPU under-utilization and low throughput.  ...  DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination such a static indicator into dynamic operations in terms of runtime latency.  ... 
arXiv:2011.03897v2 fatcat:zruq2wonqzhkjhsexrd6qrytty

Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters [article]

Wei Zhang, Quan Chen, Kaihua Fu, Ningxin Zheng, Zhiyi Huang, Jingwen Leng, Chao Li, Wenli Zheng, Minyi Guo
2020 arXiv   pre-print
The two policies consider the microservice pipeline effect and the runtime GPU resource contention when allocating resources for the microservices.  ...  Compared with state-of-the-art work, Camelot increases the supported peak load by up to 64.5% with limited GPUs, and reduces 35% resource usage at low load while achieving the desired 99%-ile latency target  ...  The unstable runtime contention behavior results in the long tail latency.  ... 
arXiv:2005.02088v1 fatcat:ehnexhl4wrchjbpeligidyocru

Multi-tenant mobile offloading systems for real-time computer vision applications

Zhou Fang, Jeng-Hau Lin, Mani B. Srivastava, Rajesh K. Gupta
2019 Proceedings of the 20th International Conference on Distributed Computing and Networking - ICDCN '19  
The optimization techniques to improve the runtime efficiency of DNN inference tasks are explored as well, such as data parallelization and adaptive batching in the inference phase of DNN models.  ...  Our work (see Chapter 3) provides APIs to program such complex applications, with infrastructure supports to optimize DNN inference and scheduling methods to co-locate RT and non-RT tasks on shared GPUs  ...  For example, the Python class that implements an object detection microservice includes an initialization method to load a DNN model (dnn model) from a configuration file (configs), and a run method to  ... 
doi:10.1145/3288599.3288634 dblp:conf/icdcn/FangLS019 fatcat:qpib2wkm7jdnfg7k64eh3hwje4

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision [article]

Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen
2022 arXiv   pre-print
More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers  ...  The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter.  ...  of inference executions with tail latency or failures.  ... 
arXiv:2205.11913v3 fatcat:fnbinueyijb4nc75fpzd6hzjgq

Equinox: Training (for Free) on a Custom Inference Accelerator

Mario Drumond, Louis Coulon, Arash Pourhabibi, Ahmet Caner Yüzügüler, Babak Falsafi, Martin Jaggi
2021 MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture  
inference services' tail latency goals.  ...  We show that relaxing latency constraints in an inference accelerator with ALU arrays that are batching-optimized achieves near-optimal throughput for a given area and power envelope while maintaining  ...  We would also like to thank Simla Harma for conducting HBFP convergence experiments and sharing the results with us.  ... 
doi:10.1145/3466752.3480057 fatcat:4m3kivurp5c6lcvrlfdcr5qmly

Machine Learning for Microcontroller-Class Hardware – A Review [article]

Swapnil Sayan Saha, Sandeep Singh Sandha, Mani Srivastava
2022 arXiv   pre-print
Researchers use a specialized model development workflow for resource-limited applications to ensure the compute and latency budget is within the device limits while still maintaining the desired performance  ...  Finally, we identify the open research challenges and unsolved questions demanding careful considerations moving forward.  ...  Loop unrolling helps eliminate branch penalties and helps hide memory access latencies [130] .  ... 
arXiv:2205.14550v3 fatcat:y272riitirhwfgfiotlwv5i7nu

Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in Public Cloud [article]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Mahmut Taylan Kandemir, Chita R. Das
2021 arXiv   pre-print
However, selecting the appro-priate models dynamically at runtime to meet the desiredaccuracy with low latency at minimal deployment cost is anontrivial problem.  ...  latency along with reduced deploymentcosts in a public cloud environment.  ...  Resource Types: We use both CPU and GPU instances 4a depending on the request arrival load. GPU instances are cost-effective when packed with a large batch of requests for execution.  ... 
arXiv:2106.05345v1 fatcat:572nnljzy5f4bfqm5yftp3zn6y

Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services

Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, Magnus Sjalander
2020 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)  
Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres.  ...  Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres.  ...  Finally, principal component analysis [30] is performed to determine the most vital and distinct PMCs that capture the tail latency. This is similar to the methodology of Malik et al. [31] .  ... 
doi:10.1109/hpca47549.2020.00023 dblp:conf/hpca/NishtalaPCS20 fatcat:v6qsmppcn5emtlizljctltszce

A Survey of Machine Learning Applied to Computer Architecture Design [article]

Drew D. Penney, Lizhong Chen
2019 arXiv   pre-print
This paper reviews machine learning applied system-wide to simulation and run-time optimization, and in many individual components, including memory systems, branch predictors, networks-on-chip, and GPUs  ...  Recent work, however, has explored broader applicability for design, optimization, and simulation.  ...  Runtime prediction used two sample configurations, one from CPU execution and one from GPU execution, to determine the optimal configuration. Lo et al.  ... 
arXiv:1909.12373v1 fatcat:o4nscgkjfbes7kqwmtjvvgl3oa

Distributed workflows with Jupyter

Iacopo Colonnelli, Marco Aldinucci, Barbara Cantalupo, Luca Padovani, Sergio Rabellino, Concetto Spampinato, Roberto Morelli, Rosario Di Carlo, Nicolò Magini, Carlo Cavazzoni
2021 Future generations computer systems  
As a byproduct, we extended the vanilla IPython kernel with workflow-based parallel and distributed execution capabilities.  ...  Jupyter Notebooks, with their capability to describe both imperative and declarative code in a unique format, allow taking the best of the two approaches, maintaining a clear separation between application  ...  the European Union's Horizon 2020 research and innovation programme under grant agreement No. 825111 [79] , and the ACROSS project, 13 ''HPC Big Data Artificial Intelligence Cross Stack Platform Towards  ... 
doi:10.1016/j.future.2021.10.007 fatcat:2al5dpxqmrgeboqxgkgxbefxga

Learning Low-Precision Structured Subnetworks Using Joint Layerwise Channel Pruning and Uniform Quantization

Xinyu Zhang, Ian Colbert, Srinjoy Das
2022 Applied Sciences  
by hardware platforms such as GPUs and FPGAs.  ...  We demonstrate that our proposed method works effectively in combination with statistics-based quantization techniques to generate low precision structured subnetworks that can be efficiently accelerated  ...  While previous works have alluded to this optimization [26] , we are not aware of any research that has explicitly exploited it as we do.  ... 
doi:10.3390/app12157829 fatcat:6ihyiujmordovf3stohpzue22q

AutonoML: Towards an Integrated Framework for Autonomous Machine Learning [article]

David Jacob Kedziora and Katarzyna Musial and Bogdan Gabrys
2022 arXiv   pre-print
Central to this drive is the appeal of engineering a computational system that both discovers and deploys high-performance solutions to arbitrary ML problems with minimal human interaction.  ...  Over the last decade, the long-running endeavour to automate high-level processes in machine learning (ML) has risen to mainstream prominence, stimulated by advances in optimisation techniques and their  ...  DNN candidates only when GPUs are free.  ... 
arXiv:2012.12600v2 fatcat:6rj4ubhcjncvddztjs7tql3itq

The Sky Above The Clouds [article]

Sarah Chasins, Alvin Cheung, Natacha Crooks, Ali Ghodsi, Ken Goldberg, Joseph E. Gonzalez, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph, Michael W. Mahoney, Aditya Parameswaran, David Patterson (+5 others)
2022 arXiv   pre-print
For example, telephony, the Internet, and PCs all started with a single provider, but in the United States each is now served by a competitive market that uses comprehensive and universal technology standards  ...  , David Tennenhouse, Marvin Theimer, Deepak Vij, and Matei Zaharia.  ...  We would like to thank our many colleagues who have generously read the early versions of this paper and provided insightful feedback: Dirk Bergemann, Adrian Cockcroft, David Culler, Ian Foster, Mark Russinovich  ... 
arXiv:2205.07147v1 fatcat:mpn7ivg4arghdbkyj42kii6x3a

Wi-Fi Meets ML: A Survey on Improving IEEE 802.11 Performance with Machine Learning [article]

Szymon Szott, Katarzyna Kosek-Szott, Piotr Gawłowicz, Jorge Torres Gómez, Boris Bellalta, Anatolij Zubow, Falko Dressler
2022 arXiv   pre-print
To this end, we analyze over 250 papers in the field, providing readers with an overview of the main trends.  ...  These technical innovations, including the plethora of configuration parameters, are making next-generation WLANs exceedingly complex as the dependencies between parameters and their joint optimization  ...  This approach allows Wi-Fi to quantify the effective available channel airtime of each Wi-Fi link (downlink/uplink) at runtime.  ... 
arXiv:2109.04786v3 fatcat:ny55qfhsnfduzcxyve5mylpr2m

Wi-Fi Meets ML: A Survey on Improving IEEE 802.11 Performance with Machine Learning

Szymon Szott, Katarzyna Kosek-Szott, Piotr Gawlowicz, Jorge Torres Gomez, Boris Bellalta, Anatolij Zubow, Falko Dressler
2022 IEEE Communications Surveys and Tutorials  
To this end, we analyze over 250 papers in the field, providing readers with an overview of the main trends.  ...  These technical innovations, including the plethora of configuration parameters, are making next-generation WLANs exceedingly complex as the dependencies between parameters and their joint optimization  ...  This approach allows Wi-Fi to quantify the effective available channel airtime of each Wi-Fi link (downlink/uplink) at runtime.  ... 
doi:10.1109/comst.2022.3179242 fatcat:sqmcwxuawrchjkaprnak4kawym
« Previous Showing results 1 — 15 out of 55 results