FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing [chapter]

Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hannes Weisbach, Matthias Hille, Hermann Härtig, Alexander Margolin, Dror Sharf, Ely Levy, Pavel Gak (+7 others)
2020 Lecture Notes in Computational Science and Engineering  
The FFMK project designs, builds and evaluates a system-software architecture to address the challenges expected in Exascale systems. In particular, these challenges include performance losses caused by the much larger impact of runtime variability within applications, hardware, and operating system (OS), as well as increased vulnerability to failures. The FFMK OS platform is built upon a multi-kernel architecture, which combines the L4Re microkernel and a virtualized Linux kernel into a
more » ... ree, yet feature-rich execution environment. It further includes global, distributed platform management and system-level optimization services that transparently minimize checkpoint/restart overhead for applications. The project also researched algorithms to make collective operations fault tolerant in presence of failing nodes. In this paper, we describe the basic components, algorithms, and services we developed in Phase 2 of the project. Applications, Runtimes, Communication HPC applications are highly specialized, but they achieve a certain level of platform independence by using common runtime and communication APIs such as the Message Passing Interface (MPI) [18] . However, just proving an MPI library and interconnect drivers (e.g., for InfiniBand) is not sufficient [60], because the majority of HPC codes use many Linux-specific APIs, too. The same is true for most HPC infrastructure, including parallel file systems and cluster management solutions. Compatibility to Linux is therefore essential and this can only be achieved, if applications are in fact compiled for Linux and started as Linux processes. Dynamic Platform Management The FFMK OS platform is more than a multikernel architecture. As motivated in the Phase 1 report [59], we include distributed global management, because the system software is best suited to monitor health and load of nodes. This is in contrast to the current way of operating HPC clusters and supercomputers, where load balancing problems and fault tolerance are tasks that practically every application deals with on its own. In the presence of frequent component failures, hardware heterogeneity, and dynamic resource demands, applications can no longer assume that compute resources are assigned statically. Load Balancing We aim to shift more coordination and decision making into the system layer. In the FFMK OS, the necessary monitoring and decision making is done at three levels: (1) on each node, (2) per application instance across multiple nodes, and (3) based on a global view by redundant master management nodes. We published fault-tolerant gossip algorithms [3] suitable for inter-node information dissemination and found that they have negligible performance overhead [36] . We further achieved promising results with regard to oversubscribing of cores, which can improve throughput for some applications [62] . We have since integrated the gossip algorithm, a per-node monitoring daemon, and a distributed decision making algorithm aimed at automatic, process-level load balancing for oversubscribed nodes. However, one key component of this platform management service is still missing: the ability to migrate processes from overloaded nodes to ones that have spare CPU cycles. Transparent migration of MPI processes that directly access InfiniBand hardware has proven to be extremely difficult. We leave this aspect for a future publication, but do we do summarize our key results on novel diffusion-based load balancing algorithms [39] in this report. These algorithms could be integrated into the FFMK load management service once process-level migration is possible. Fault Tolerance The ability to migrate processes away from failing nodes can also be used for proactive fault tolerance. However, the focus of our research on systemlevel fault tolerance has been in two other areas. First, we published on efficient collective operations in the presence of failures [27, 31, 43] . Second, we continued research on scalable checkpointing, where we concentrated on global coordination
doi:10.1007/978-3-030-47956-5_16 fatcat:accs6fosezfuzme75yc63nrkum