The Importance of Non-Data-Communication Overheads in MPI

P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk
2010 The international journal of high performance computing applications  
With processor speeds no longer doubling every 18-24 months owing to the exponential increase in power consumption and heat dissipation, modern HEC systems tend to rely lesser on the performance of single processing units. Instead, they rely on achieving high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. Using such low-frequency cores, however, puts a premium on end-host pre-and post-communication processing required within communication
more » ... acks, such as the message passing interface (MPI) implementation. Similarly, small amounts of serialization within the communication stack that were acceptable on small/medium systems can be brutal on massively parallel systems. Thus, in this paper, we study the different non-data-communication overheads within the MPI implementation on the IBM Blue Gene/P system. Specifically, we analyze various aspects of MPI, including the MPI stack overhead itself, overhead of allocating and queueing requests, queue searches within the MPI stack, multi-request operations and various others. Our experiments, that scale up to 131,072 cores of the largest Blue Gene/P system in the world (80% of the total system size), reveal several insights into overheads in the MPI stack, which were previously not considered significant, but can have a substantial impact on such massive systems. BG/P Hardware and Software Stacks In this section, we present a brief overview of the hardware and software stacks on BG/P. Blue Gene/P Communication Hardware BG/P has five different networks [12] . Two of them, 10-Gigabit Ethernet and 1-Gigabit Ethernet with JTAG interface 1 , are used for File I/O and system management. The other three networks, described below, are used for MPI communication: 3-D Torus Network: This network is used for point-to-point MPI and multicast operations and connects all compute nodes to form a 3-D torus. Thus, each node has six nearest-neighbors. Each link provides a bandwidth of 425 MBps per direction for a total of 5.1 GBps bidirectional bandwidth per node. Global Collective Network: This is a one-to-all network for compute and I/O nodes used for MPI collective communication and I/O services. Each node has three links to the collective network for a total of 5.1GBps bidirectional bandwidth. Global Interrupt Network: It is an extremely low-latency network for global barriers and interrupts. For example, the global barrier latency of a 72K-node partition is approximately 1.3µs. On the BG/P, compute cores do not handle packets on the torus network. A DMA engine on each compute node offloads most of the network packet injecting and receiving work, so this enables better overlap with of computation and communication. The DMA interfaces directly with the torus network. However, the cores handle the sending/receiving packets from the collective network.
doi:10.1177/1094342009359258 fatcat:sqckcl5zarfybjtes2y65gsabe