Improving software pipelining with hardware support for self-spatial loads

Steve Carr, Philip Sweany
1999 SIGARCH Computer Architecture News  
Recent work in software pipelining in the presence of uncertain memory latencies has shown that using compilergenerated cache-reuse analysis to determine proper load latencies can improve performance significantly [14, 19, 9] . Even with reuse information, references with a stride-one access pattern in the cache (called self-spatial loads) have been treated as all cache hits or all cache misses rather than as a single cache miss followed by a few cache hits in the rest of the cache line. In
more » ... paper, we show how hardware support for loading two consecutive cache lines with one instruction (called a prefetching load) when directed by the compiler can significantly improve software pipelining for scientific program loops. On set of 79 Fortran loops when using prefetching loads, we observed an average performance improvement of 7% over assuming all self-spatial loads are cache misses (assuming all hits often gives worse performance than assuming all misses [14] ). In addition, prefetching loads reduced floating-point register pressure by 31% and integer register pressure by 20%. As a result, we were able to software pipeline 31% more loops within modern register constraints (32 integer/32 floating-point registers) with prefetching loads. These results show that specialized prefetching load instructions have considerable potential to improve software pipelining for array-based scientific codes.
doi:10.1145/309758.309784 fatcat:icxaotwhbnbu7i6lfb65ubr3em