Malleability in resource allocation for improved system efficiency in high-performance computing

Lead Research Organisation: University of Edinburgh
Department Name: Edinburgh Parallel Computing Centre

Abstract

A significant part of the environmental impact and CO2 emissions of a high-performance computing (HPC) system can be attributed to its manufacturing as well as its operation (including running idle). Once a system has been installed, it is therefore imperative that it is used as close to full capacity as possible and that science throughput should be maximised at all times, in order to get the best return on investment on both the monetary and carbon cost of the system. This highly desirable 100% utilisation rate is however near impossible to achieve in practice. The workload of a system is managed by its resource allocator, which attempts to place jobs from a submission queue (that users continuously add new jobs to) to fill gaps in the available resources. It is not always possible to attain perfect job placement and as a result, resources sit idle.

Malleability in resource allocation introduces the concept that the resources (the number of compute cores or nodes, or even the system) that have been requested by a user at job submission time are not fixed and can be changed if this change means a job can be scheduled to run, and thus complete, sooner.

MIRA ("Malleability In Resource Allocation for improved system efficiency in high-performance computing") will investigate the concept of malleability in compute resource allocation within a single system as well as across multiple systems, to improve overall system utilisation and science throughput, thereby maximising the "science per Joule'' that can be achieved.

Publications

10 25 50