MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems

Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, Keith R. Bisset, Rajeev Thakur
2012 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems  
Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. We
more » ... uce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. We demonstrate the extensible design of MPI-ACC by using the popular CUDA and OpenCL accelerator programming interfaces. We examine the impact of MPI-ACC on communication performance and evaluate application-level benefits on a large-scale epidemiology simulation. Keywords-MPI; GPU; CUDA; OpenCL; MPI-ACC of whether buffers reside in GPU memory-because MPI must inspect the location of every buffer at runtime using CUDA's cuPointerGetAttribute function. Depending on the result of this query, different code paths will be executed to handle buffers residing in GPU or host memory. This query is expensive relative to extremely low-latency communication times and can add significant overhead to hostto-host communication operations. In Figure 2 we measure the impact of this query on the latency of intranode, CPU-to-CPU, communication using MVAPICH v1.8 on the experimental platform described in Section V.
doi:10.1109/hpcc.2012.92 dblp:conf/hpcc/AjiDBBFBT12 fatcat:eneekatrqzepdixjhq2hzecakq