Fast and Scalable Startup of MPI Programs in InfiniBand Clusters [chapter]

Weikuan Yu, Jiesheng Wu, Dhabaleswar K. Panda
2004 Lecture Notes in Computer Science  
Fast and scalable process startup is one of the major challenges in parallel computing over large scale clusters. The startup of a parallel job typically can be divided into two phases: process initiation and connection setup. Both of these phases can become performance bottlenecks. In this paper, we characterize the startup of MPI programs in InfiniBand clusters and identify two startup scalability issues: serialized process initiation in the initiation phase and high communication overhead in
more » ... the connection setup phase. We propose different approaches to reduce communication overhead and provide fast process initiation. Specifically, to reduce the connection setup time, we have developed one approach with data reassembly to reduce data volume and another with a bootstrap channel to parallelize the communication. Furthermore, we have exploited a process management framework, Multi-purpose Daemons (MPD) system to speed up the process initiation phase. The bootstrap channel is utilized to overcome the scalability limitations of MPD. Our experimental results show that job startup time has been improved by more than 4 times for 128-process jobs in an InfiniBand cluster. Scalability Models derived from these results suggest that the improvement can be more than two orders of magnitudes for the startup of 2048-process jobs. This section provides an overview of MPI program startup in InfiniBand clusters and motivate the study for a scalable startup scheme. MVAPICH [12] is a high performance implementation of MPI over InfiniBand. Its design is based on on MPICH [6] and MVICH [11] . The current implementation of MVAPICH utilizes the Reliable Connection (RC) service for the communication between processes. Startup of MPI Applications using MVAPICH The connection-oriented feature of IBA RC-based QPs requires each process to create at least one QP for every peer process. To form a fully connected network of N processes, a parallel application needs to create and connect at least
doi:10.1007/978-3-540-30474-6_47 fatcat:lo3gch2tbbeihmh2ami2bywj6i