Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Ioan Hadade, Luca di Mare
2016 Computer Physics Communications  
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner
more » ... or. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor. (I. Hadade). which guaranteed improvements in serial performance with every new CPU generation and limited code intervention. Those days are now gone partly due to the recognition that clock frequency cannot be scaled indefinitely because of power consumption, and partly because circuitry density on the chip is approaching the limit of existing technologies which is problematic as innovations in sequential execution require a high fraction of die real estate [3] . The current trend in CPU design is parallelism [4] and has led to the rise of multicore and manycore processors whilst effectively ending the so called "free lunch" era [5, 6] in performance scaling. Modern multicore and manycore processors now consist of double digit core numbers integrated on the same die, vector units with associated instruction set extensions, multiple backend execution ports and deeper and more complex memory hierarchies with features such Uniform-Memory-Access (UMA) and Non-Uniform-Memory-Access (NUMA), to name a few. Consequently, achieving any reasonable performance on these architectures mandates the exploitation of all architectural features and their intrinsic parallelism across all granularities (core, data and instruction) [6] . Core and data parallelism are exposed in modern applications through two principal mechanisms: threads and Single Instruction http://dx.
doi:10.1016/j.cpc.2016.04.006 fatcat:ndhpdqbuonhwrhcfnumglj5sge