A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning
[article]
2021
arXiv
pre-print
Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs occurs due to resource contention. Interference-aware job placement has been studied, with white-box approaches based on explicit interference modeling and black-box schedulers with reinforcement learning. In today's clusters containing thousands of GPU
arXiv:2112.13354v1
fatcat:csthoe3fuffurm3c3supvznsta