Program Analysis and Compiler Transformations for Computational Accelerators

Taylor Lloyd
2018
Heterogeneous computing is becoming increasingly common in high-end computer systems, with vendors often including compute accelerators such as Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FP-GAs) for increased throughput and power efficiency. This thesis addresses the usability and performance of compute accelerators, with an emphasis on compiler-driven analyses and transformations. First this thesis studies the challenge of programming for FPGAs. IBM and Intel both now
more » ... produce systems with integrated FPGAs, but FPGA programming remains extremely challenging. To mitigate this difficulty, FPGA vendors now ship OpenCL-based High-Level Synthesis (HLS) tools, capable of generating Hardware Description Language (HDL) from Open Compute Language (OpenCL) source. Unfortunately, most OpenCL source today is written to be executed on GPUs, and runs poorly on FPGAs. This thesis explores traditional compiler analyses and transformations to automatically transform GPU-targeted OpenCL, achieving speedups up to 6.7x over unmodified Rodinia OpenCL benchmarks written for GPUs. Second, this thesis addresses the problem of automatically mapping OpenMP 4.X target regions to GPU hardware. In OpenMP, the compiler is responsible for determining the number and grouping of GPU threads, and the existing heuristic in LLVM/Clang performs poorly for a large subset of programs. We perform an exhaustive data collection over 23 OpenMP benchmarks from the SPEC ACCEL and Unibench suites. From our dataset, we propose a new grid ii geometry heuristic resulting in a 25% geometric mean speedup over geometries selected by the original LLVM/Clang heuristic. The third contribution of this thesis is related to the performance of an application executing in GPUs. Such performance can be significantly degraded by irregular data accesses and by control-flow divergence. Both of these performance issues arise only in the presence of thread-divergent expressions-an expression that evaluates to different values for different threads. This thesis introduces GPUCheck: a static analysis tool that detects branch divergence and non-coalesceable memory accesses in GPU programs. GPUCheck relies on a static dataflow analysis to find thread-dependent expressions and on a novel symbolic analysis to determine when such expressions could lead to performance issues. Kernels taken from the Rodinia benchmark suite and repaired by GPUCheck execute up to 30% faster than the original kernels. The fourth contribution of this thesis focuses on data transmission in a heterogeneous computing system. GPUs can be used as specialized accelerators to improve network connectivity. We present Run-Length Base-Delta (RLBD) encoding, a very high-speed compression format and algorithm capable of improving throughput of 40GbE up to 57% on datasets taken from the UCI Machine Learning Repository. iii Preface Chapter 4 has been published as T.
doi:10.7939/r3z892x2m fatcat:xidz27urjrdurgqm4iis4bd2me