A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

Christoph Riesinger, Arash Bakhtiari, Martin Schreiber, Philipp Neumann, Hans-Joachim Bungartz
2017 Computation  
Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with
more » ... ation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 10 9 lattice cells. Computation 2017, xx, x 2 of 26 Today's large-scale HPC systems are based on various architectures. One example of such an architecture is given by heterogeneous clusters assembled from different types of computing devices (CPU, GPU, field programmable gate array (FPGA), Intel Xeon Phi, accelerators such as the PEZY accelerator 1 or Google's tensor processing unit (TPU), etc.), node sizes (thin nodes, fat nodes), or network topologies (torus, fat tree, dragonfly, etc.). We focus on CPU/GPU heterogeneous clusters, in the following just denoted as CPU/GPU clusters, where the clusters' nodes are uniformly equipped with at least one CPU and one GPU computing device. However, all discussed concepts are generalizable to heterogeneous systems consisting of alternative computing devices, too. Large-scale CPU/GPU clusters provide huge computational performance, but successfully exploiting this computational potential introduces several challenges. Currently, there are two representatives of such systems in the top five of the fastest supercomputers 2 and more systems of that kind are expected to evolve in the next years for energy and efficiency reasons. Multiple levels of hardware parallelism are introduced by CPU/GPU clusters. On a lower level, that is at the level of the particular computing devices, CPUs contain multiple cores, each usually augmented with a SIMD unit, and a GPU is composed of multiprocessors each containing several processing elements. On a higher level, multiple CPUs and GPUs, potentially located in different nodes of the cluster, have to be orchestrated to cooperatively perform the computational tasks. Facing this complexity, programming all these levels of parallelism in a scalable way is a non-trivial task. For the application of the LBM, it requires tailored implementations of the LBM operators for CPUs and GPUs, a domain decomposition to evenly distribute work among different computing devices, and a communication scheme to exchange data of subdomain boundaries hiding communication times behind computation. In the following, a comprehensive set of best practices and techniques targeting the challenges when programming CPU/GPU clusters to achieve scalability of the LBM is presented, eventually leading to a hybrid reference implementation. There are already numerous successful attempts to implement the LBM on single multi-[1] and many-core processors [2-7], on small-[8] and large-scale clusters [9,10], and on heterogeneous systems consisting of single [11, 12] and multiple nodes [13] [14] [15] . Many existing attempts for large-scale heterogeneous systems assign communication tasks [14, 15] or minor workloads to the CPUs [13] underusing their computational performance. In contrast, we aim at fully exploiting all available computing devices of a large-scale CPU/GPU cluster at all times. Hence, an approach in which all different kinds of computing devices perform the same actions, namely the operations of the LBM, is pursued. We show that CPUs can contribute a considerable amount of performance in a heterogeneous scenario and that parallel efficiencies of more than 90% are achievable by simultaneously running on 2,048 GPUs and 24,576 CPU cores on the heterogeneous supercomputer Piz Daint, thus, with a number of resources on the petaFLOPS scale. The holistic scalable implementation approach of the LBM for CPU/GPU heterogeneous clusters was first proposed by Riesinger in [16] . This paper builds on Riesinger's work providing a compact representation and focusing on the essential topics. Initial work on the hybrid reference code was done by Schreiber et al. [17] and we use these GPU kernels. Our work significantly extends this implementation by porting and parallelizing them for the CPU, then, utilizing multiple CPUs and GPU concurrently. For a performance efficient and scalable implementation, this requires simultaneous data copies, non-blocking communication, and communication hiding. The remainder of this paper is structured as follows: In Section 2, related work on LBM in the context of HPC is discussed and put into context with our holistic CPU/GPU approach. A brief introduction to the basic implementation concepts of the LBM is provided in Section 3. Section 4 lists the implementation techniques for the LBM leading to a scalable code to fully exploit CPU/GPU heterogeneous clusters including optimization and parallelization of the CPU and GPU kernels, 1 https://www.pezy.co.jp/en/index.html 2 https://www.top500.org/list/2017/06/ Computation 2017, xx, x 3 of 26 a domain decomposition and resource assignment, and an efficient communication scheme. The performance of the particular LBM kernels, the advantage of a heterogeneous over a homogeneous approach, and the scalability of our implementation are evaluated in Section 5, followed by Section 6 giving a brief conclusion and summary. Related work Besides multi-core CPUs [1], GPUs have been a target platform of the LBM for more than a decade, dating back to the pre-CUDA era [18, 19] . For single-GPU setups, Tölke et al. [2] provide a first attempt using the D3Q13 discretization scheme. By applying the D3Q19 discretization scheme, Bailey et al. [3] experimentally validate that a GPU implementation can deliver superior performance to an optimized CPU implementation. Due to this fact, GPU-only frameworks coping with complex applications from engineering and science have also evolved throughout the last years; one example is given by ELBE [20]. The LBM is memory-bound, so many implementations aim for memory optimizations [7] . Obrecht et al. [5] achieve up to 86% of the theoretical peak memory bandwidth of a GPU with a tailored memory access pattern and Rinaldi et al. [6] can even excel this performance value by performing all LBM operations in shared memory making kernel fusion indispensable. Classical examples for hybrid programming incorporating more than one programming model are given by and Rabenseifner et al. [22] . They use OpenMP for the shared memory and MPI for the distributed memory parallelization for arbitrary applications detached from the LBM but limited to homogeneous CPU clusters. These hybrid programming techniques can also be utilized for the LBM as shown by Linxweiler in [23] . A comparison of the efficiency of OpenMP in combination with MPI with Unified Parallel C, a partitioned global address space (PGAS) language, for the LBM is provided in [24] . Moving on to homogeneous GPU clusters, an implementation of the LBM is reported by Obrecht et al. [8] for small-scale and by Wang et al. [9] for large-scale clusters. In the latter work, the authors overlap communication with computations as we do in this work, too. Communication becomes more complex if multi-layered boundaries are exchanged between multiple GPUs as occurring when using larger stencils such as the D2Q37 discretization scheme. Nonetheless, Calore et al. [10] testify that such tasks can be efficiently handled by homogeneous GPU clusters. Since we restrict considerations to the D3Q19 discretization scheme, we do not have to deal with multi-layered boundary exchange. Proceeding to heterogeneous systems, Ye et al. [12] present an approach to assign work to the CPUs and to the GPU of one node using OpenMP and CUDA. Calore et al. [25] do not only deal with CPU/GPU heterogeneous nodes but with CPU/Intel Xeon Phi heterogeneous installations. A more flexible, patch-based idea is proposed by Feichtinger et al. [11] implicitly introducing an adjustable load balancing mechanism. This mechanism requires management information besides the actual simulation data making it more flexible but introducing information which has to be managed in a global manner. Hence, we refrain from global management information. Such implementations for single node or small-scale heterogeneous systems are not exclusively limited to the LBM but can be extended to scenarios in which the LBM is incorporated as the fluid solver in a fluid-structure interaction (FSI) application [26] ; similar to our implementation, both computing devices execute the same functionality and the heterogeneous setup performs better than a homogeneous setup. Heterogeneous systems can even speed-up the LBM using adaptive mesh refinement (AMR) as demonstrated by . A massively parallel implementation of the LBM for large-scale CPU/GPU heterogeneous clusters is presented by Calore et al. [15] to carry out studies of convective turbulence. Yet, Calore et al. degrade the CPUs to communication devices omitting their computational performance. A more comprehensive solution is chosen by Shimokawabe et al. [13] who assign the boundary cells of a subdomain to the CPU and the inner cells to the GPU. This leads to a full utilization of the GPUs but just to a minor utilization of the CPUs. We follow Shimokawabe's parallelization strategy which does not evenly
doi:10.3390/computation5040048 fatcat:aew7bpt7gbaafay2g6ay266gba