Improving effective bandwidth through compiler enhancement of global cache reuse

Chen Ding, Ken Kennedy
2004 Journal of Parallel and Distributed Computing  
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data
more » ... se and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy to its full extent, this research has developed a set of compiler transformations that perform computation fusion and data grouping over the whole program and during the entire execution. The major new techniques and their unique contributions are Maximal loop fusion: an algorithm that achieves maximal fusion among all program statements and bounded reuse distance within a fused loop. Inter-array data regrouping: the first to selectively group global data structures and to do so with guaranteed profitability and compile-time optimality. Locality grouping and dynamic packing: the first set of compiler-inserted and compiler-optimized computation and data transformations at run time.
doi:10.1016/j.jpdc.2003.09.005 fatcat:lt762atuijgefjrr4mqpm6q3wm