Portable and Scalable Algorithm for Irregular All-to-All Communication

Wenheng Liu, Cho-Li Wang, Viktor K. Prasanna
2002 Journal of Parallel and Distributed Computing  
In irregular all-to-all communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message start-ups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm
more » ... des deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of high-performance computing (HPC) platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network interface. # 2002 Elsevier Science (USA) 1 All rights reserved. the need to perform irregular and data-dependent communication operations in parallelizing intermediate and high-level vision problems. A straightforward approach to perform irregular all-to-all communication is to send a message to each node one by one. However, this simple approach is inefficient due to the large start-up latency and possible node contention. Start-up latency is the major source of overhead in message passing particularly on loosely coupled platforms. The variance in the message size causes node contention and, thus results in the serialization of the message start-ups. Several algorithms have been proposed to perform all-to-all communication [5, 16, 20, 24 ] motivated by particular topologies, e.g., mesh, hypercube, etc. Besides, the variance of the message size is not considered in the design of these algorithms. In this paper, we design and analyze an efficient algorithm for irregular all-to-all communication. The algorithm is suitable for state-of-the-art HPC platforms. Our algorithm consists of four stages. The first two stages perform message distribution, while the rest of the stages perform message collection. In each stage, the nodes are partitioned into several groups. Within each group, an all-to-all communication is performed. At each node, some local memory accesses are performed to compose the messages to be communicated. The algorithm reduces the serialization of associated message passing start-ups (or node contention) by balancing the length of the messages communicated in each stage. It also reduces the overall communication latency by reducing the number of message start-ups. For comparison purpose, we estimate the performance of the algorithm based on "flat" communication model. This model captures the features of the interconnection networks of HPC platforms in which the software overheads dominate the hardware latencies [22, 30] . In this model, let L max denote the maximum traffic (in bytes) at a node. Let T s be the start-up overhead for the main processor to traverse through the software layers to send and receive a message, t d denotes the data transfer time per byte, t c the data copy latency per byte between local memory and interface, and t m the latency of local memory access per byte. Given P nodes, the total time to perform techniques, nonetheless, can be exploited to optimize performance of MPI primitives. They can also be exploited at the lower level (operating system and network interface level) using machine-specific features primitives to further improve the performance of communication primitives. Some platforms support nonblocking communication mode. On these platforms, message composing and message communication can be performed in a pipelined fashion. We can, therefore, overlap communication with local memory access. The implementation of our approach is shown using such nonblocking primitives. These designs are advantageous in realizing throughput-oriented implementation in using HPC technology for image and signal processing applications. To verify the effectiveness of our approach, we compare the performance of our algorithm against the straightforward single-stage approach (denoted as A1) and two prior algorithms in [2, 26] (denoted as A2) and in [17] (denoted as A3). All the algorithms were implemented on SP2 and T3D. The experimental results can be summarized as follows: Compared with A1 and A2, our algorithm reduces the number of messages communicated. This improves the performance when blocking communication mode is used. Compared with A1 and A3, our algorithm reduces node contention as well as minimizes the buffer space needed at each node. Overlapping communication with local memory access using nonblocking communication mode further improves the performance of our algorithm on SP2. The rest of the paper is organized as follows. Section 2 describes the latencies in performing irregular communication. Section 3 describes the algorithms and implementation details. Section 4 summarizes our experimental results on T3D and SP2 and compares it with those obtained by earlier approaches. Section 5 concludes the paper. IRREGULAR ALL-TO-ALL COMMUNICATION In this section, we define a model for the cost of performing irregular all-to-all communication on state-of-the-art HPC platforms. Section 2.1 discusses a general communication system of HPC platforms and defines a "flat" communication model for performance analysis. Section 2.2 defines blocking and nonblocking communication modes. Section 2.3 models the communication time to perform irregular communication and identifies the latencies induced by node contention. Communication Latencies In the state-of-the-art HPC platforms, the communication bottleneck is usually not the bandwidth of the network fabrics. Instead, latencies induced at the sender and the receiver cause the bottleneck [10, 18] . These include latencies introduced by message start-ups and memory copying. Figure 1 illustrates the steps in sending and receiving a message in a typical state-of-the-art HPC platform. m denotes the message size in bytes. The latency of a message passing Operation is the total time spent by the message to traverse through the communication path from the sender to the receiver. First, data stored in noncontiguous memory locations are coalesced to a IRREGULAR ALL-TO-ALL COMMUNICATION FIG. 1. Major steps in message passing in a typical state-of-the-art HPC platform. LIU, WANG, AND PRASANNA
doi:10.1006/jpdc.2002.1862 fatcat:z4rxcjsb4jelpb4ksgt6uus37m