Effective automatic computation placement and dataallocation for parallelization of regular programs
Proceedings of the 28th ACM international conference on Supercomputing - ICS '14
This paper proposes techniques for data allocation and computation mapping when compiling affine loop nest sequences for distributedmemory clusters. Techniques for transformation and detection of parallelism, and generation of communication sets relying on the polyhedral framework already exist. However, these recent approaches used a simple strategy to map computation to nodestypically block or block-cyclic. These mappings may lead to excess communication volume for multiple loop nests. In
... tion, the data allocation strategy used did not permit efficient weak scaling. We address these complementary problems by proposing automatic techniques to determine computation placements for identified parallelism and allocation of data. Our approach for data allocation is driven by tiling of data spaces along with a scheme to allocate and deallocate tiles on demand and reuse them. We show that our approach for computation mapping yields more effective mappings than those that can be developed using vendor-supplied libraries. Experimental results on some sequences of BLAS calls demonstrate a mean speedup of 1.82× over versions written with ScaLAPACK. Besides enabling weak scaling for distributed memory, data tiling also improves locality for shared-memory parallelization. Experimental results on a 32-core shared-memory SMP system shows a mean speedup of 2.67× over code that is not data tiled.