An Experimental Study of Self-Optimizing Dense Linear Algebra Software
Proceedings of the IEEE
Analytical models of the memory hierarchy are used to explain the performance of self-optimizing software. ABSTRACT | Memory hierarchy optimizations have been studied by researchers in many areas including compilers, numerical linear algebra, and theoretical computer science. However, the approaches taken by these communities are very different. The compiler community has invested considerable effort in inventing loop transformations like loop permutation and tiling, and in the development of
... mple analytical models to determine the values of numerical parameters such as tile sizes required by these transformations. Although the performance of compiler-generated code has improved steadily over the years, it is difficult to retarget restructuring compilers to new platforms because of the need to develop analytical models manually for new platforms. The search for performance portability has led to the development of self-optimizing software systems. One approach to self-optimizing software is the generate-and-test approach, which has been used by the dense numerical linear algebra community to produce highperformance BLAS and fast Fourier transform libraries. Another approach to portable memory hierarchy optimization is to use the divide-and-conquer approach to implementing cacheoblivious algorithms. Each step of divide-and-conquer generates problems of smaller size. When the working set of the subproblems fits in some level of the memory hierarchy, that subproblem can be executed without capacity misses at that level. Although all three approaches have been studied extensively, there are few experimental studies that have compared these approaches. How well does the code produced by current self-optimizing systems perform compared to hand-tuned code? Is empirical search essential to the generate-andtest approach or is it possible to use analytical models with platform-specific parameters to reduce the size of the search space? The cache-oblivious approach uses divide-and-conquer to perform approximate blocking; how well does approximate blocking perform compared to precise blocking? This paper addresses such questions for matrix multiplication, which is the most important dense linear algebra kernel.