Topology agnostic hot-spot avoidance with InfiniBand

Abhinav Vishnu, Matthew Koop, Adam Moody, Amith Mamidala, Sundeep Narravula, Dhabaleswar K. Panda
2009 Concurrency and Computation  
InfiniBand has become a very popular interconnect due to its advanced features and open standard. Large-scale InfiniBand clusters are becoming very popular, as reflected by the TOP 500 supercomputer rankings. However, even with popular topologies such as constant bi-section bandwidth Fat Tree, hot-spots may occur with InfiniBand due to inappropriate configuration of network paths, presence of other jobs in the network and un-availability of adaptive routing. In this paper, we present a hot-spot
more » ... avoidance layer (HSAL) for InfiniBand, which provides hot-spot avoidance using path bandwidth estimation and multi-pathing using LMC mechanism, without taking the network topology into account. We propose an adaptive striping policy with batch-based striping and sorting approach, for efficient utilization of disjoint network paths. Integration of HSAL with MPI, the de facto programming model of clusters, shows promising results with collective communication primitives and MPI applications. A. VISHNU ET AL. InfiniBand [2] in particular, due to its advanced features and open standard. Parallel applications executing on these clusters primarily use MPI [3,4] as the de facto programming model. Fat Tree [5] has become a very popular interconnection topology for these clusters, primarily due to its multipathing capability. However, even with constant bi-section bandwidth (CBB) Fat Tree, hot-spot(s) may occur in the network depending upon the route configuration(s) between end nodes and communication pattern(s) in an application. Other factors including the presence of other jobs in the network and topology un-aware scheduling of tasks by program launchers may significantly impact the performance of applications. To make the matters worse, the deterministic routing nature of InfiniBand limits the application from an effective use of multiple paths transparently to avoid the hot-spot(s) in the network. InfiniBand specification [2] provides a congestion control protocol, which leverages an early congestion notification mechanism between switches and adapters. However, this approach enforces a reduced data transfer on the existing path, while other paths in the network may be left under-utilized. A popular mechanism for providing hot-spot avoidance is to leverage multi-pathing. In our previous studies, we provided a framework for supporting multi-pathing at the end nodes [6], popularly referred to as multi-rail networks. We provided an abstraction for supporting multiple ports and multiple adapters and studied various scheduling policies. The evaluation at the MPI layer provided promising results with collective communication and applications. We expanded the basic ideas proposed in this study to provide hot-spot avoidance using the LMC mechanism [7]. Using the adaptive striping policy and the LMC mechanism, we studied the performance with collective communication primitives and MPI applications. We observed that HSAM (hot-spot avoidance with MVAPICH [8]) is an efficient mechanism for providing hot-spot avoidance. In this study, we thoroughly review our previous proposals and alleviate the deficiencies of HSAM. The inherent limitations of the HSAM design prohibit the utilization of all the physically disjoint paths at run time. As a result, better paths may never be even explored. In this paper, we present a hot-spot avoidance layer (HSAL), which performs batch-based striping and sorting (BSS) during the application execution to adaptively eliminate the path(s) with low bandwidth. The design challenges for integration of MPI with HSAL are also discussed in detail. We also compare the HSAM [7] scheme with MPI integrated with HSAL to compare the performance of different BSS configurations and the original case (no multi-pathing at all). Using MPI Alltoall, we can achieve an improvement of 27 and 32% in latency with different BSS configurations compared with the best configuration of the HSAM scheme on 32 and 64 processes, respectively. A default mapping of tasks in the cluster shows similar benefits. Using the Fourier transform benchmark from NAS Parallel Benchmarks [9] with different problem sizes, the execution time can be improved by 5-7% with different BSS configurations compared with the best HSAM configuration and 11-13% from the original implementation. Other NAS Parallel Benchmarks [9] do not incur any performance degradation. TOPOLOGY AGNOSTIC HOT-SPOT AVOIDANCE WITH INFINIBAND 303 BACKGROUND This section presents the background information of our study. To begin with, an introduction to InfiniBand is provided. This is followed by an introduction to MPI [3, 4] , including two-sided point-to-point communication and collective communication primitives. Overview of InfiniBand The InfiniBand architecture [2] defines a switched network fabric for interconnecting processing nodes and I/O nodes. An InfiniBand network may consist of switches, adapters (called Host Channel Adapters (HCAs)) and links for communication. InfiniBand supports different classes of transport services: (reliable connection (RC), unreliable connection, reliable datagram and unreliable datagram). RC transport mode supports remote direct memory access (RDMA), which makes it an attractive choice for designing communication protocols. We use this transport for this study. Under this model, each process pair creates a unique entity for communication, called queue pair (QP). Each QP consists of two queues: send queue and receive queue. The requests to send the data to the peer are placed on the send queue, by using a mechanism called descriptor. A descriptor describes the information necessary for a particular operation. For RDMA operation, it specifies the local buffer, address of the peer buffer and access rights for manipulation of the remote buffer. InfiniBand also provides a mechanism, where different queue pairs can share their receive queues, called shared receive queue mechanism. The completions of descriptors are posted on a queue called completion queue. This mechanism allows a sender to know the status of the data transfer operation. Different mechanisms for notification are also supported (polling and asynchronous). InfiniBand defines an entity called subnet manager, which is responsible for discovery, configuration and maintenance of a network. Each InfiniBand port in a network is identified by one or more local identifiers (LIDs), which are assigned by the subnet manager. As InfiniBand supports only destination-based routing for data transfer, each switch in the network has a routing table corresponding to the LID(s) of the destination. However, deterministic routing nature of InfiniBand limits the intermediate switches to route the messages adaptively. To overcome this limitation, InfiniBand provides a mechanism, LID mask count (LMC), which can be used for specifying multiple paths between every pair of nodes in the network. The subnet manager may be specified with different values of LMC mechanism (0-7), creating a maximum of 128 paths. Leveraging the LMC mechanism to avoid the hot-spot(s) in the network is the focus of this paper. Overview of MPI protocols Message passing interface (MPI) [3,4] defines multiple communication semantics. The two-sided communication semantics [3] has been widely used in the last decade or hence for writing a majority of parallel applications. Two-sided communication semantics are broadly designed using the following protocols: • Eager protocol: In the eager protocol, the sender process eagerly sends the entire message to the receiver. In order to achieve this, the receiver needs to provide sufficient buffers to handle
doi:10.1002/cpe.1359 fatcat:hak5tghzcvg6rpb6cw4x4v5joa