swSpTRSV

Xinliang Wang, Wei Xue, Weifeng Liu, Li Wu
2018 Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '18  
Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data
more » ... se under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main buildingblock of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively. Unfortunately, in most real-world matrices (as collected in the University of Florida Sparse Matrix Collection [8]), the synchronization cost dominates the overall execution time. This makes the performance of parallel SpTRSV through the level-set methods far from satisfactory. Compared to sparse matrix-vector multiplication (SpMV) [31, 33, 58, 64] , the SpTRSV kernel has exactly the same calculation cost (in term of the amount of arithmetic and memory accessing operations) but can be up to over a hundred times slower than SpMV on modern processors [23, 25, 29, 30] . Based on a comprehensive study, Li and Saad [25] pointed out that SpTRSV is the actual performance bottleneck of parallel preconditioned iterative solvers, due to the high cost of interlevel synchronization. Even though recent research (e.g., sparsifying synchronization by pruning [40] and replacing synchronization by atomic operations [29]) improved level-set method through reducing the amount of synchronization, it has not explored the potential of memory subsystems of modern processors. This underexploration is reflected in two aspects: (1) data reuse of x and b overly relies on cache, which is hardware managed and may not supply the best data swapping; and (2) inter-level synchronization, even of the recently proposed methods [29, 40] , needs to go through shared global memory, which is too slow compared to inter-core communication. Our proposed method is primarily concerned with parallel SpTRSV aware of data locality and fast synchronization. Besides the parallelism, which has been already developed by the level-set methods, we further tap the potential from memory access for higher performance. We propose a new data layout called Sparse Level Tile, or SLT for short, to divide a sparse matrix into two types of 2D tiles with nonuniformed shapes. By carefully establishing the connections between these tiles, the SLT layout gives highly efficient data reuse for both solution x and right-hand side b, and migrates the fine-grained, random and unprefetchable memory access to coarse-grained, predictable and prefetchable. As for fast synchronization, we best exploit the inter-core communication of the newly developed SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight (125 Pflops peak performance, 93 Pflops sustained LINPACK performance, composed of 40960 SW26010 processors [1]). The processor offers a register communication scheme that works in the same row or column of its cores in a 2D mesh. This regular communication pattern offers opportunities for fast inter-core communication but also challenges the irregular sparse matrix problems we are facing. Based on the relationship between x i and b i , we design a Producer-Consumer pairing method, where the paired x i and b i are held in the paired Producer and Consumer respectively. The paired Producer and Consumer are in the same row which makes any inter-level synchronization only happen through register communication in the same row. Meanwhile, such method CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE CPE 8 8 CPE cluster SPM Main Memory Main Memory Main Memory Main Memory
doi:10.1145/3178487.3178513 dblp:conf/ppopp/WangLXW18 fatcat:tgsa7oatxva5hlpu3mlbu5f7oe