Constraint Programming-based Job Dispatching for Modern HPC Applications

Cristian Alejandro Galleguillos Miccono
2021
A High-Performance Computing (HPC) job dispatcher is a critical software that assigns the finite computing resources of a system to the jobs submitted by users, who request some of such computing resources to execute their software. This assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions, generated by a job dispatcher, must be computed in real-time, and the required time to generate them cannot exceed some
more » ... eshold to do not affect the normal system functioning. In addition, the job dispatcher must deal with a lot of uncertainty: unknown submission times, an unknown quantity of requested resources, and unknown (actual) duration of jobs. Heuristic techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. These heuristics are composed of two separate elements, the scheduling part, and the resource allocation part, thus generate a decoupled decision being the major culprit of performance loss. Optimization techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time. Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. Usually, HPC users tend to overestimate the duration of jobs, making it challenging to identify short jobs at dispatching time. However, prediction methods may be useful to improve the accuracy of the expected duration and classify jobs correctly. Therefore, HPC job dispatchers need to process large numbers of short jobs quickly and make decisions on-line while ensuring high Quality-of-Service (QoS) levels and meet demanding response times to generate dispatching decisions. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-ofthe-art CP-based job dispatchers are unable to satisfy the challenges of online dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system. Both limitations jeopardize achieving high QoS levels and thus impede the adoption of CP-based dispatchers in HPC systems. Given the previous reasons, the purpose of this work is to propose a class of CP-based dispatchers that are more suitable for HPC systems running modern applications. To identify the class of jobs, we propose a method to predict job durations, allowing to include more online information in dispatchers. The job i dispatchers we propose are able to reduce the time required for generating online dispatching decisions significantly, and are able to make effective use of job duration predictions to decrease waiting times and job slowdowns, especially for workloads dominated by short jobs. ii iv List of Tables 2.1 Frequency and average duration of all jobs and the three classes CPU-based, MIC-based and GPU-based in the Eurora workload. 23 3.
doi:10.6092/unibo/amsdottorato/9497 fatcat:rwq6ihwqe5hnhkir7wefgpv6w4