Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

John Jenkins, James Dinan, Pavan Balaji, Nagiza F. Samatova, Rajeev Thakur
2012 2012 IEEE International Conference on Cluster Computing  
Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU-acceleration of largescale scientific and engineering computations. A particular challenge is the efficient transfer of noncontiguous data to and from GPU memory. MPI supports such transfers through the use of datatypes, however an efficient means of utilizing datatypes for noncontiguous data in GPU memory is not currently known. To address this gap, we present the design and implementation
more » ... of efficient MPI datatypes processing system, which is capable of efficiently processing arbitrary datatypes directly on the GPU. We present a means for converting conventional datatype representations into a GPU-tractable format, which exposes parallelism. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array subvolumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of datatypes that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPUto-GPU communication time.
doi:10.1109/cluster.2012.72 dblp:conf/cluster/JenkinsDBST12 fatcat:nao2aiqflfgzrp7byepxhnrwri