TracSim: Simulating and scheduling trapped power capacity to maximize machine room throughput

Ziming Zhang, Michael Lang, Scott Pakin, Song Fu
2016 Parallel Computing  
The power supplied to machine rooms tends to be over-provisioned because it is specified in practice not by workload demands but rather by high energy LINPACK runs or nameplate power estimates. This results in a considerable amount of trapped power capacity-excess power infrastructure. Instead of being wasted, this trapped power capacity should be reclaimed to accommodate more compute nodes in the machine room and thereby increase system throughput. But to do this we need the ability to enforce
more » ... a system-wide power cap. In this paper, we present TracSim, a full-system simulator that enables users to measure trapped power capacity and evaluate the performance of different policies for scheduling parallel tasks under a power cap. TracSim simulates the execution environment of a production HPC cluster at Los Alamos National Laboratory (LANL). TracSim enables users to specify the system topology, hardware configuration, power cap, and task workload and to develop resource configuration and task scheduling policies aimed at maximizing machine-room throughput while keeping power consumption under a power cap by exploiting CPU throttling techniques. We use real measurements from the LANL cluster to set TracSim's configuration parameters. We leverage TracSim to implement and evaluate four resource scheduling policies. Simulation results indicate the performance of those policies and quantify the amount of trapped capacity that can effectively be reclaimed.
doi:10.1016/j.parco.2015.11.002 fatcat:ibbpyoxkinfdnmu7x2lavaqom4