1,329 Hits in 6.5 sec

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning [article]

Xiaoyang Zhao, Chuan Wu
2021 arXiv   pre-print
Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance.  ...  In today's clusters containing thousands of GPU servers, running a single scheduler to manage all arrival jobs in a timely and effective manner is challenging, due to the large workload scale.  ...  4 GPUs and the other (8 cores) with 2 GPUs; (iii) heterogeneous server configurations, where for the partition managed by each scheduler, 20% of the servers each have 2 GPUs and 1 CPU (8 cores), 40% adopt  ... 
arXiv:2112.13354v1 fatcat:csthoe3fuffurm3c3supvznsta

Model-driven Cluster Resource Management for AI Workloads in Edge Clouds [article]

Qianlin Liang, Walid A. Hanafy, Ahmed Ali-Eldin, Prashant Shenoy
2022 arXiv   pre-print
Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads.  ...  After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting  ...  𝜌 = 𝜆/𝑐𝜇 ( MODEL-DRIVEN CLUSTER RESOURCE MANAGEMENT In this section, we show how the predictive capabilities of our analytic models can be employed for cluster resource management tasks such as  ... 
arXiv:2201.07312v1 fatcat:d4wdw7frbvcfvjwgfwvcod5ufa

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision [article]

Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen
2022 arXiv   pre-print
An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization.  ...  However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources.  ...  We discuss prior works based on whether they adopt heterogeneous resources, GPU sharing and elastic training. Heterogeneous Resources.  ... 
arXiv:2205.11913v3 fatcat:fnbinueyijb4nc75fpzd6hzjgq

A systems perspective on GPU computing

Naila Farooqui
2016 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit - GPGPU '16  
To this end, his contributions include novel scheduling and resource management abstractions, runtime specialization, and novel data management techniques to support scalable, distributed GPU frameworks  ...  His vision encompassed the conceptualization, implementation, and demonstration of systems abstractions and runtime methods to elevate GPUs into first-class citizens in today's and future heterogeneous  ...  Acknowledgments We would like to thank Professor Sudhakar Yalamanchili, Ada Gavrilovska, Vishakha Gupta, Sudarsun Kannan, Alexander Merritt, and Dipanjan Sengupta for their feedback and assistance with  ... 
doi:10.1145/2884045.2884057 dblp:conf/ppopp/Farooqui16 fatcat:lcxhf6nfsvannnbp5lusxudmmu

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge [article]

Anirban Bhattacharjee, Ajay Dev Chhokra, Hongyang Sun, Shashank Shekhar, Aniruddha Gokhale, Gabor Karsai, Abhishek Dubey
2020 arXiv   pre-print
However, efficiently utilizing the edge resources for the model update is a hard problem due to the heterogeneity among the edge devices and the resource interference caused by the co-location of the DL  ...  To overcome these challenges, we present Deep-Edge, a load- and interference-aware, fault-tolerant resource management framework for performing model update at the edge that uses distributed training.  ...  However, these approaches are not applicable for edge clusters as none of them considers resource interference while allocating heterogeneous resources for the DL model update task. VII.  ... 
arXiv:2004.05740v1 fatcat:mcc7gcdjkzef5d7cd3jbuht444

2021 Index IEEE Transactions on Parallel and Distributed Systems Vol. 32

2022 IEEE Transactions on Parallel and Distributed Systems  
., +, TPDS Aug. 2021 2086-2100 Learning-Driven Interference-Aware Workload Parallelization for Stream- ing Applications in Heterogeneous Cluster.  ...  ., +, TPDS July 2021 1578-1590 Learning-Driven Interference-Aware Workload Parallelization for Stream- ing Applications in Heterogeneous Cluster.  ...  Graph coloring Feluca: A Two-Stage Graph Coloring Algorithm With Color-Centric Paradigm on GPU. Zheng, Z., +,  ... 
doi:10.1109/tpds.2021.3107121 fatcat:e7bh2xssazdrjcpgn64mqh4hb4

DRMaestro: orchestrating disaggregated resources on virtualized data-centers

Marcelo Amaral, Jordà Polo, David Carrera, Nelson Gonzalez, Chih-Chieh Yang, Alessandro Morari, Bruce D'Amora, Alaa Youssef, Malgorzata Steinder
2021 Journal of Cloud Computing: Advances, Systems and Applications  
After that, we evaluate DRMaestro via a real prototype on Kubernetes and a trace-driven simulation.  ...  The results show that for some applications the impact is minimal, but other ones can suffer up to 80% slowdown in the data transfer part.  ...  Chih-Chieh Yang performed the analysis of the possible performance interference introduced by additional network load in disaggregated resources.  ... 
doi:10.1186/s13677-021-00238-6 fatcat:njwlthulevep7m2jqlbbd4n4de


Xi Luo, Wei Wu, George Bosilca, Thananon Patinyasakdikul, Linnan Wang, Jack Dongarra
2018 Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18  
In particular, we demonstrate at least 1.3× and 1.5× speedup for CPU data and 2× and 10× speedup for GPU data using ADAPT event-based broadcast and reduce operations.  ...  We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters.  ...  [6] present a hardware multicast-based broadcast which benefits from IB hardware multicast. However, none of them encompasses network hierarchical topology for heterogeneous GPU-based clusters.  ... 
doi:10.1145/3208040.3208054 dblp:conf/hpdc/LuoWBPWD18 fatcat:cfebghog25cktm6qj2gzjkllru

Construction of Artistic Design Patterns Based on Improved Distributed Data Parallel Computing of Heterogeneous Tasks

Yao Sun, Gengxin Sun
2022 Mathematical Problems in Engineering  
Each node relies on ZooKeeper to form a cluster, which realizes distributed functions such as centralized management of distributed resources, failover, and resumed transmission.  ...  Based on the analysis and study of the multicore CPU and GPU architectures in the desktop system, as well as the original CPU-GPU heterogeneous parallel technology, this article optimizes the solution  ...  Under the von Neumann architecture, the operation of programs is instruction based and instruction flow driven.  ... 
doi:10.1155/2022/3890255 fatcat:jnybktiryjhxhbvhwlul2nk7ly

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [article]

Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing
2021 arXiv   pre-print
Pollux promotes fairness among DL jobs competing for resources based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments.  ...  resource and training configurations for every job.  ...  Acknowledgements We thank our shepherd, Michael Isard, and the anonymous OSDI reviewers for their insightful comments and suggestions that improved our work.  ... 
arXiv:2008.12260v2 fatcat:wupzzej7crf4bek53a6scbqtli

Managing GPU Concurrency in Heterogeneous Architectures

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das
2014 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture  
GPU-based concurrency management techniques when employed in heterogeneous systems.  ...  The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable feedback.  ... 
doi:10.1109/micro.2014.62 dblp:conf/micro/KayiranNJAKLMD14 fatcat:v5xeff76hjeibkfjdp6se4hhta

Techniques for Shared Resource Management in Systems with Throughput Processors [article]

Rachata Ausavarungnirun
2018 arXiv   pre-print
Our evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based  ...  We propose changes to the memory controller design and its scheduling policy to mitigate inter-application interference in heterogeneous CPU-GPU systems.  ...  Acknowledgements First and foremost, I would like to thank my parents, Khrieng and Ruchanee Ausavarungnirun for their endless encouragement, love, and support.  ... 
arXiv:1803.06958v1 fatcat:3mqbwegpkvdrpk6sqwb3ooyh7e

A survey of cloud resource management for complex engineering applications

Haibao Chen, Song Wu, Hai Jin, Wenguang Chen, Jidong Zhai, Yingwei Luo, Xiaolin Wang
2016 Frontiers of Computer Science  
As a new type of Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs.  ...  Traditionally, Complex Engineering Applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usually run in dedicated clusters or high performance  ...  Acknowledgements We thank the anonymous reviewers for their insightful comments and suggestions. This work was supported by National Science Foundation of China under  ... 
doi:10.1007/s11704-015-4207-x fatcat:iq5odhb6o5djvcdhcnx6cxohzi

ValuePack: Value-based scheduling framework for CPU-GPU clusters

Vignesh T. Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
With such heterogeneous environments becoming common, it is important to revisit scheduling problems for clusters and cloud environments.  ...  In this paper, we formulate and address the problem of value-driven scheduling of independent jobs on heterogeneous clusters, which captures both the urgency and relative priority of jobs.  ...  Torque [10] , an open-source resource manager, is being widely used in hundreds of supercomputer centers to manage heterogeneous clusters comprising multicore CPUs and GPUs.  ... 
doi:10.1109/sc.2012.111 dblp:conf/sc/RaviBAC12 fatcat:hy5rwe5wfna3hb4bmndf34vhse

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters [article]

Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang
2021 arXiv   pre-print
Second, we introduce a general-purpose framework, which manages resources based on historical data.  ...  Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry.  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable comments.  ... 
arXiv:2109.01313v1 fatcat:izw77evef5fpzb2ent3u6adyca
« Previous Showing results 1 — 15 out of 1,329 results