Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor

Jakub Kurzak, Wesley Alvaro, Jack Dongarra
2009 Parallel Computing  
Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics
more » ... g Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C ¼ C À A Â B T operation and the C ¼ C À A Â B operation for matrices of size 64 Â 64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures. The current trend in processor design is towards chips with multiple processing units, commonly referred to as multi-core processors [3] [4] [5] . It has been postulated that building blocks of future architectures are likely to be simple processing elements with shallow pipelines, in-order execution, and SIMD capabilities [6] . It has also been pointed out that direct control over the memory hierarchy may be desired, and software-managed scratchpad memory may be superior to traditional caches [6] . It can be observed that the Synergistic Processing Element of the CELL processor closely matches this description. There is no doubt that future processors will differ significantly from the current designs and will reshape the way of thinking about programming such systems. By the same token, investigation into micro-kernel development for the SPE may have a broader impact by providing an important insight into programming future multi-core architectures. Performance considerations State of the art numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [7, 8] . Public domain libraries such as LAPACK [9] and ScaLAPACK [10] are good examples. These implementations work on square or rectangular submatrices in their inner loops, where operations are encapsulated in calls to Basic Linear Algebra Subroutines (BLAS) [11] , with emphasis on expressing the computation as Level 3 BLAS, matrixmatrix type, operations. Frequently, the call is made directly to the matrix multiplication routine _GEMM. At the same time, all the other Level 3 BLAS can be defined in terms of _GEMM and a small amount of Levels 1 and 2 BLAS [12] . A lot of effort has been invested in optimized BLAS by hardware vendors as well as academic institutions through projects such as ATLAS [13] and GotoBLAS [14]. At the same time, the inefficiencies of the BLAS layer have been pointed out [15] as well as the shortcomings of its fork-join parallelization model [16] . Owing to this, the emerging trend in linear algebra is towards the use of specialized data structures such as Block Data Layout (BDL) [17, 18] and the expression of algorithms directly in terms of specialized inner-kernels [19] . Although application of these techniques is not always straightforward, problems can be often remedied by novel algorithmic approaches [20, 21] . The innovation in CELL software has been progressing faster than elsewhere, with direct use of inner-kernels, out-of-order execution and Block Data Layout being a common practice [22] [23] [24] . As a result, performance of algorithms comes much closer to the speed of _GEMM for much smaller problem sizes [24] . Any improvement to the _GEMM routine immediately benefits the entire algorithm, which makes the optimization of the _GEMM routine yet more important for the CELL processor.
doi:10.1016/j.parco.2008.12.010 fatcat:wd6lh2pn2nbztlqxqfyrknnmvu