An iteration-based hybrid parallel algorithm for tridiagonal systems of equations on multi-core architectures

Guangping Tang, Wangdong Yang, Kenli Li, Yu Ye, Guoqing Xiao, Keqin Li
2015 Concurrency and Computation  
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel model, this algorithm combines the cyclic reduction method and the partition method. This hybrid algorithm has simple backward substitution on parallel computers comparing with the cyclic reduction method. In this paper, the operation count and execution time are obtained to
more » ... and make comparison for these methods. On the basis of results of these measured parameters, the hybrid algorithm using the hybrid approach with a multi-threading implementation achieves better efficiency than the other parallel methods, that is, the cyclic reduction and the partition methods. In particular, the approach involved in this paper has the least scalar operation count and the shortest execution time on a multi-core computer when the size of equations meets some dimension threshold. The hybrid parallel algorithm improves the performance of the cyclic reduction and partition methods by 19.2% and 13.2%, respectively. In addition, by comparing the single-iteration and multi-iteration hybrid parallel algorithms, it is found that increasing iteration steps of the cyclic reduction method does not affect the performance of the hybrid parallel algorithm very much. ITERATION-BASED HYBRID PARALLEL ALGORITHM 5077 In the past, several direct and iterative methods have been proposed for this problem. The iterative methods [3, 4] are mainly Jacobi's method, Gauss-Seidel, and successive overrelaxation. The direct methods include Thomas algorithm [5] and several parallel methods. Although the Thomas algorithm is the fastest algorithm on a serial computer, it is not directly and completely parallelizable because each step of this algorithm depends on the preceding one [6]. The present paper will describe several parallel tridiagonal solvers and present a new hybrid parallel algorithm for the tridiagonal systems. Related research The KLU (Clark Kent LU) algorithm, one method in existing numerical libraries, was proposed by Ekanathan Palamadai [7] . KLU is a sparse high-performance linear solver that employs hybrid ordering mechanisms and elegant factorization and solve algorithms. It achieves high quality fill-in rate and beats many existing solvers in run time, when used for matrices arising in circuit simulation. But KLU is a fast serial algorithm. Research into parallelization strategies of tridiagonal solvers continues to be an active area of exploration. The Thomas algorithm, a specific Gaussian elimination method, is one of the first algorithms considered for tridiagonal linear system. This algorithm can be completed in two distinct steps, forward elimination and backward substitution. The first step consists of 'removing the coefficients under the diagonal, and then this leaves the new system with the coefficient matrix of two diagonals. The second step is to exploit backward sweep eliminating for the new system in order to calculate the solution. Part of the Thomas algorithm can also be parallelized. T.M. Edison and G. Erlebacher [8] transformed a sequential Thomas algorithm into the parallel, chained, load-balanced algorithm. Lambiotte and Voigt [3] presented the cyclic reduction method on the CDC STAR-100 computer. The cyclic reduction method implemented on GPU [9] is also one divide-and-conquer algorithm [10, 11] . In this approach, all the odd-indexed unknowns are eliminated while the even-indexed ones are left. Then, we can get half of the previous system of equations but with the same tridiagonal structure [12] . Finally, the odd-indexed unknowns in the reduced tridiagonal systems are recursively eliminated until only a minimal number of equations that can be directly solved are reached. The results of this trivial system are then recursively back-substituted to realize the full solution of the tridiagonal systems [13] . Stone first reported the recursive doubling method [14, 15] . In this algorithm, an LU [16] decomposition of the tridiagonal matrix A is carried out so that Ax D LUx D L´D y where Ux D´. This algorithm consists of two primary steps. In a forward sweep, the system L´D y is solved followed by a backward sweep solving Ux D´ [13] . Parallel extensions of the recurrence relationships that are used in the forward and backward sweeps of the recursive doubling algorithm use the recursive doubling form, in which each recursion relationship is expressed in the light of two functions that are each half as complex [6] . In addition, Zhang et al. [2] reported the parallel cyclic reduction and hybrid algorithms containing hybrid parallel cyclic reduction and hybrid recursive doubling algorithms. The parallel partition method, another example of a divide-and-conquer algorithm, was first developed by Wang [1] . With this algorithm, the tridiagonal matrix A is first partitioned into a suitable number of sub-matrices. Then we do eliminations simultaneously for each of the submatrices until the coefficient matrix A is diagonalized [10] . Rozloznik et al. [17] presented a symmetric tridiagonalization algorithm. This algorithm effectively exploits high-performance level-3 Basic Linear Algebra Subprograms routines and could be as efficient as symmetric block-diagonalization. Besides tridiagonal solvers, block tridiagonal solvers based on cyclic reduction and parallel cyclic reduction are particularly amenable to efficient and scalable parallelization [13, 18] . The recursive doubling method is developed for parallel computers and supercomputers such as the Illiac IV [19] consisting of multi-processors. The cyclic reduction method, the partition method, and other parallel algorithms can also be implemented efficiently on parallel computers. Among these methods, the the partition method may be the worst when n is larger. With respect to computational complexity, although these three algorithms are O.n/, the single-iteration hybrid parallel algorithm can be
doi:10.1002/cpe.3511 fatcat:fcmtcrbwvrhrdnjaahoucjpztq