Static Compilation Analysis for Host-Accelerator Communication Optimization [chapter]

Mehdi Amini, Fabien Coelho, François Irigoin, Ronan Keryell
2013 Lecture Notes in Computer Science  
We present an automatic, static program transformation that schedules and generates ecient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the
more » ... erated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled eectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All, hmpp, and pgi, and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.
doi:10.1007/978-3-642-36036-7_16 fatcat:jpftk6kotjbgtnq2pux3ep7wly