CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator
Journal of Algorithms & Computational Technology
Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at
... instructionlevel for the GPU. This paper makes two contributions to the performance optimizations for GPUs. We analyze different approaches for optimizing the memory usage and access patterns for GPUs and propose a class of memory layout optimizations that can take full advantage of the unique memory hierarchy of NVIDIA CUDA. Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU. In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture. To demonstrate these optimizations, we picked an embarrassingly parallel algorithm used to calculate gravitational forces. This algorithm allows us to demonstrate and to explain the performance increase gained by the applied optimizations. Our results show that our approach is quite effective. After applying our technique to the algorithm used in the Gravit gravity simulator, we observed a 1.27x speedup compared to the baseline GPU implementation. This represents a 87x speedup to the original CPU implementation.