A Compressed Diagonals Remapping Technique for Dynamic Data Redistribution on Banded Sparse Matrix [chapter]

Ching-Hsien Hsu, Kun-Ming Yu
2003 Lecture Notes in Computer Science  
In this paper, we present a new method, Compressed Diagonals Remapping (CDR) technique aims to the efficiency of data redistribution on banded sparse matrices. The main idea of the proposed technique is first to compress the source matrix into a Compressed Diagonal Matrix (CDM) form. Based on the compressed diagonal matrix, a one-dimensional local and global index transformation can be carried out to perform data redistribution on the compressed diagonal matrix, which is identical to
more » ... e data in the banded sparse matrix. The CDR technique uses an efficient one-dimensional indexing scheme to perform data redistribution on banded sparse matrix. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced significantly. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the consecutive attribute of column elements in a compressed diagonal matrix. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this paper. To evaluate the performance of our methods, we have implemented the present techniques on an IBM SP2 parallel machine along with the v2m algorithm and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of banded sparse matrices in all test samples. Keywords: compressed diagonals remapping, data redistribution, banded matrix, sparse matrix, parallel algorithm, runtime support. statements of arrays that were distributed in arbitrary BLOCK-CYCLIC(c) fashion. They also presented closed form expressions of communication sets for restricted block size. A similar approach that addressed the problems of the index set and the communication sets identification for array statements with BLOCK-CYCLIC(c) distribution was presented in [25] . In [25] , the BLOCK-CYCLIC(k) distribution was viewed as an union of k CYCLIC distribution. Since the communication sets for CYCLIC distribution is easier to determine, communication sets for BLOCK-CYCLIC(k) distribution can be generated in terms of unions and intersections of some CYCLIC distributions. However, in many scientific applications, such as Fast Fourier Transform, the Alternative Direction Implicit (ADI) method for solving two-dimensional diffusion equations, signal processing, and linear algebra solvers, a distribution that is well suited for one phase may not be good for a subsequent phase in terms of performance. Data redistribution is required for those algorithms during runtime. Since the redistribution is performed at runtime, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of reallocating data among processors. Thus efficient methods for performing data redistribution are of great importance for the development of distributed memory compilers for those languages. Techniques for dynamic data redistribution of dense arrays are discussed in many researches [3, 6, 11, 12, 14, 15, 18-22, 24, 26]. A detailed expatiation of these techniques was described in [11] . These techniques can be classified into three categories according to the type of redistribution problem that they solved. First, the general case solutions, methods in this category provide algorithms to perform array redistribution between processors that might be two disjoint sets in the source and destination distribution. The PITFALLS [24] and the ScaLAPACK [22] methods are two examples. They pay more attention on the indexing and the packing/unpacking issues. Second, the special case solutions, methods in this category assume that the redistribution of an array is under the same source/destination processor set. In general, they provide algorithms to generate the communication sets for some specific type of redistribution, such as BLOCK to CYCLIC redistribution, BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution [26], and BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution [5], where k, r, s, t are positive integers. The BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution is the most general case in this category. Finally, the communication optimization solutions, methods in this category, in general, provide different approaches to reduce the communication overheads in a redistribution process. Examples are the processor mapping technique [14], the multiphase redistribution technique [15], communication scheduling approach [6] and the strip mining approach [28]. Methods in this category pay more attention on the communication issue. Prior works focused on dense array and regular distributions. When data redistribution is carried out on sparse matrices, these algorithms become inapplicable since large amount of memory and data transmission costs will be wasted. Another difficulty of redistributing data on sparse matrices is because nonzero elements are
doi:10.1007/3-540-37619-4_8 fatcat:chcwbycbnzd3pdijtb42mibvta