A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning
[article]
2021
arXiv
pre-print
Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. ...
In today's clusters containing thousands of GPU servers, running a single scheduler to manage all arrival jobs in a timely and effective manner is challenging, due to the large workload scale. ...
4 GPUs and the other (8 cores) with 2 GPUs; (iii) heterogeneous server configurations, where for the partition managed by each scheduler, 20% of the servers each have 2 GPUs and 1 CPU (8 cores), 40% adopt ...
arXiv:2112.13354v1
fatcat:csthoe3fuffurm3c3supvznsta
Model-driven Cluster Resource Management for AI Workloads in Edge Clouds
[article]
2022
arXiv
pre-print
Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. ...
After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting ...
𝜌 = 𝜆/𝑐𝜇 (
MODEL-DRIVEN CLUSTER RESOURCE MANAGEMENT In this section, we show how the predictive capabilities of our analytic models can be employed for cluster resource management tasks such as ...
arXiv:2201.07312v1
fatcat:d4wdw7frbvcfvjwgfwvcod5ufa
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
[article]
2022
arXiv
pre-print
An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. ...
However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. ...
We discuss prior works based on whether they adopt heterogeneous resources, GPU sharing and elastic training.
Heterogeneous Resources. ...
arXiv:2205.11913v3
fatcat:fnbinueyijb4nc75fpzd6hzjgq
A systems perspective on GPU computing
2016
Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit - GPGPU '16
To this end, his contributions include novel scheduling and resource management abstractions, runtime specialization, and novel data management techniques to support scalable, distributed GPU frameworks ...
His vision encompassed the conceptualization, implementation, and demonstration of systems abstractions and runtime methods to elevate GPUs into first-class citizens in today's and future heterogeneous ...
Acknowledgments We would like to thank Professor Sudhakar Yalamanchili, Ada Gavrilovska, Vishakha Gupta, Sudarsun Kannan, Alexander Merritt, and Dipanjan Sengupta for their feedback and assistance with ...
doi:10.1145/2884045.2884057
dblp:conf/ppopp/Farooqui16
fatcat:lcxhf6nfsvannnbp5lusxudmmu
Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge
[article]
2020
arXiv
pre-print
However, efficiently utilizing the edge resources for the model update is a hard problem due to the heterogeneity among the edge devices and the resource interference caused by the co-location of the DL ...
To overcome these challenges, we present Deep-Edge, a load- and interference-aware, fault-tolerant resource management framework for performing model update at the edge that uses distributed training. ...
However, these approaches are not applicable for edge clusters as none of them considers resource interference while allocating heterogeneous resources for the DL model update task. VII. ...
arXiv:2004.05740v1
fatcat:mcc7gcdjkzef5d7cd3jbuht444
2021 Index IEEE Transactions on Parallel and Distributed Systems Vol. 32
2022
IEEE Transactions on Parallel and Distributed Systems
., +, TPDS Aug. 2021 2086-2100
Learning-Driven Interference-Aware Workload Parallelization for Stream-
ing Applications in Heterogeneous Cluster. ...
., +, TPDS July 2021 1578-1590
Learning-Driven Interference-Aware Workload Parallelization for Stream-
ing Applications in Heterogeneous Cluster. ...
Graph coloring Feluca: A Two-Stage Graph Coloring Algorithm With Color-Centric Paradigm on GPU. Zheng, Z., +, ...
doi:10.1109/tpds.2021.3107121
fatcat:e7bh2xssazdrjcpgn64mqh4hb4
DRMaestro: orchestrating disaggregated resources on virtualized data-centers
2021
Journal of Cloud Computing: Advances, Systems and Applications
After that, we evaluate DRMaestro via a real prototype on Kubernetes and a trace-driven simulation. ...
The results show that for some applications the impact is minimal, but other ones can suffer up to 80% slowdown in the data transfer part. ...
Chih-Chieh Yang performed the analysis of the possible performance interference introduced by additional network load in disaggregated resources. ...
doi:10.1186/s13677-021-00238-6
fatcat:njwlthulevep7m2jqlbbd4n4de
In particular, we demonstrate at least 1.3× and 1.5× speedup for CPU data and 2× and 10× speedup for GPU data using ADAPT event-based broadcast and reduce operations. ...
We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. ...
[6] present a hardware multicast-based broadcast which benefits from IB hardware multicast. However, none of them encompasses network hierarchical topology for heterogeneous GPU-based clusters. ...
doi:10.1145/3208040.3208054
dblp:conf/hpdc/LuoWBPWD18
fatcat:cfebghog25cktm6qj2gzjkllru
Construction of Artistic Design Patterns Based on Improved Distributed Data Parallel Computing of Heterogeneous Tasks
2022
Mathematical Problems in Engineering
Each node relies on ZooKeeper to form a cluster, which realizes distributed functions such as centralized management of distributed resources, failover, and resumed transmission. ...
Based on the analysis and study of the multicore CPU and GPU architectures in the desktop system, as well as the original CPU-GPU heterogeneous parallel technology, this article optimizes the solution ...
Under the von Neumann architecture, the operation of programs is instruction based and instruction flow driven. ...
doi:10.1155/2022/3890255
fatcat:jnybktiryjhxhbvhwlul2nk7ly
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
[article]
2021
arXiv
pre-print
Pollux promotes fairness among DL jobs competing for resources based on a more meaningful measure of useful job progress, and reveals a new opportunity for reducing DL cost in cloud environments. ...
resource and training configurations for every job. ...
Acknowledgements We thank our shepherd, Michael Isard, and the anonymous OSDI reviewers for their insightful comments and suggestions that improved our work. ...
arXiv:2008.12260v2
fatcat:wupzzej7crf4bek53a6scbqtli
Managing GPU Concurrency in Heterogeneous Architectures
2014
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
GPU-based concurrency management techniques when employed in heterogeneous systems. ...
The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications ...
ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable feedback. ...
doi:10.1109/micro.2014.62
dblp:conf/micro/KayiranNJAKLMD14
fatcat:v5xeff76hjeibkfjdp6se4hhta
Techniques for Shared Resource Management in Systems with Throughput Processors
[article]
2018
arXiv
pre-print
Our evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based ...
We propose changes to the memory controller design and its scheduling policy to mitigate inter-application interference in heterogeneous CPU-GPU systems. ...
Acknowledgements First and foremost, I would like to thank my parents, Khrieng and Ruchanee Ausavarungnirun for their endless encouragement, love, and support. ...
arXiv:1803.06958v1
fatcat:3mqbwegpkvdrpk6sqwb3ooyh7e
A survey of cloud resource management for complex engineering applications
2016
Frontiers of Computer Science
As a new type of Cloud applications, CEA also brings the challenges of dealing with Cloud resources. In this paper, we provide a comprehensive survey of Cloud resource management research for CEAs. ...
Traditionally, Complex Engineering Applications (CEAs), which consist of numerous components (software) and require a large amount of computing resources, usually run in dedicated clusters or high performance ...
Acknowledgements We thank the anonymous reviewers for their insightful comments and suggestions. This work was supported by National Science Foundation of China under ...
doi:10.1007/s11704-015-4207-x
fatcat:iq5odhb6o5djvcdhcnx6cxohzi
ValuePack: Value-based scheduling framework for CPU-GPU clusters
2012
2012 International Conference for High Performance Computing, Networking, Storage and Analysis
With such heterogeneous environments becoming common, it is important to revisit scheduling problems for clusters and cloud environments. ...
In this paper, we formulate and address the problem of value-driven scheduling of independent jobs on heterogeneous clusters, which captures both the urgency and relative priority of jobs. ...
Torque [10] , an open-source resource manager, is being widely used in hundreds of supercomputer centers to manage heterogeneous clusters comprising multicore CPUs and GPUs. ...
doi:10.1109/sc.2012.111
dblp:conf/sc/RaviBAC12
fatcat:hy5rwe5wfna3hb4bmndf34vhse
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
[article]
2021
arXiv
pre-print
Second, we introduce a general-purpose framework, which manages resources based on historical data. ...
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. ...
ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable comments. ...
arXiv:2109.01313v1
fatcat:izw77evef5fpzb2ent3u6adyca
« Previous
Showing results 1 — 15 out of 1,329 results