Dynamic resource management for HPC environments

SNIC 2018/5-45


SNAC Small

Principal Investigator:

Abel Souza


UmeƄ universitet

Start Date:


End Date:


Primary Classification:

10201: Computer Sciences




High Performance Computing (HPC) clusters have been unable to properly manage a new class of highly dynamic, adaptable, big and data-intensive workloads. Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing platforms. The static resource allocation model currently used by most HPC systems was designed for monolithic MPI applications and is insufficient to support the elastic resource needs of current and future workflows. Such model makes it very challenging to scale and even harder to introduce new features and capabilities dynamic workload needs, besides poorly utilizing allotted resources. In this project, we discuss the design, implementation and evaluation of an elastic framework for managing resources for scientific workflows on current HPC systems. It will consider a resource slot for a workflow as an adaptable window that might map to different physical resources over the duration of a workflow. The framework may make use of collocation, live-migration and checkpoint-restart capabilities as the underlying mechanism to place the workflow execution across the dynamic window of resources. It will potentially provide the foundation necessary to enable dynamic resource allocation of HPC resources that are needed for streaming and real-time workflows and will be easily adaptable and extensible to current and new types of distributed workloads, allowing HPC centres to simplify operations, support new users.