QoS-Aware Scheduling of Heterogeneous Servers for Inference in Deep Neural Networks

Zhou Fang, Tong Yu, Ole J. Mengshoel, Rajesh K. Gupta
2017 Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM '17  
Deep neural networks (DNNs) are popular in diverse fields such as computer vision and natural language processing. DNN inference tasks are emerging as a service provided by cloud computing environments. However, cloud-hosted DNN inference faces new challenges in workload scheduling for the best Quality of Service (QoS), due to dependence on batch size, model complexity and resource allocation. This paper represents the QoS metric as a utility function of response delay and inference accuracy.
more » ... ference accuracy. We first propose a simple and effective heuristic approach that keeps low response delay and satisfies the requirement on processing throughput. Then we describe an advanced deep reinforcement learning (RL) approach that learns to schedule from experience. The RL scheduler is trained to maximize QoS, using a set of system statuses as the input to the RL policy model. Our approach performs scheduling actions only when there are free GPUs, thus reduces scheduling overhead over common RL schedulers that run at every continuous time step. We evaluate the schedulers on a simulation platform and demonstrate the advantages of RL over heuristics.
doi:10.1145/3132847.3133045 dblp:conf/cikm/FangYMG17 fatcat:5ndhfmekovak5cjdzebsa4evtu