Using Pilot Systems to Execute Many Task Workloads on Supercomputers [article]

Andre Merzky and Matteo Turilli and Manuel Maldonado and Mark Santcroos and Shantenu Jha
2018 arXiv   pre-print
High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular and extensible Python-based pilot system. In this paper we describe RP's design, architecture and
more » ... plementation, and characterize its performance. RP is capable of spawning more than 100 tasks/second and supports the steady-state execution of up to 16K concurrent tasks. RP can be used stand-alone, as well as integrated with other application-level tools as a runtime system.
arXiv:1512.08194v4 fatcat:wylszrloqfh35isa6weu2vxmmm