Improving MPI Applications Performance on Multicore Clusters with Rank Reordering [chapter]

Guillaume Mercier, Emmanuel Jeannot
2011 Lecture Notes in Computer Science  
Modern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2
more » ... n of the MPI Dist graph create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric. 2 Matching a communication pattern to the hardware architecture: issues and techniques General overview of the problem During an MPI application, data are exchanged among the various participating processes. The MPI programming paradigm is flat: each process may communicate with any other in the application. However, depending on pairs of processes, the amount of data sent and received (in either terms of bytes/volume or number of messages) may be irregular. Hence, each MPI application possesses a so-called communication pattern which can be considered as an intrisic characteristic [4] of the affinity between processes (here, we assume that this pattern is deterministic and does not change between executions). On the other hand, the communication channels in a multicore, NUMA nodes-based cluster are heterogeneous. Internode communication using a network is slower than intranode communication using shared memory. The novelty with multicore NUMA nodes is that communication performance is also heterogeneous within the node itself. The various levels of cache memory and the NUMA effects when accessing the main memory induce this. It is therefore rather intuitive to seek to adapt a potentially irregular communication pattern to the also heterogeneous (performance-wise) underlying hardware architecture.
doi:10.1007/978-3-642-24449-0_7 fatcat:sfbtayyzfbbkho563zqn3a3kfy