Improving the productivity of a high performance computing (HPC) system has been a non-trivial challenge since the inception of the petascale computing. As the modern HPC system is heading towards the exascale era, the design complexity of the solutions for the HPC productivity challenge is compounded with the additional constraint on the system-wide power consumption. The U.S. Department of Energy has mandated to operate a future exascale system under a strict power budget of 20MW - 30MW to support efficient electricity generation and distribution, and to keep the operational cost of an exascale computing system manageable. These challenges have contributed to pushing the delivery of the first exascale system from 2018 to 2022. In PASH project, we address the combined challenge of maximizing HPC productivity under a system-wide power constraint through power-aware resource management.
In PASH project, we proposed various static and dynamic power-aware resource management strategies to maximize system productivity under a system-wide power constraint. We define a job's productivity using a time-dependent value function to measure the importance of application output. Our resource management strategies, combine the job value functions with their power-performance models to make informed scheduling decisions at runtime. We evaluated our strategies on a real HPC cluster at Lawrence Livermore National Laboratory. We also developed a simulation environment to evaluate our scheduling algorithms on hypothetical systems.
In the future, we plan to extend the capabilities of our simulation environment by including manufacturing variations in the simulated CPUs, and design scheduling algorithms for workflow-based applications.