A Scalable Parallel Algorithm for Incomplete Factor Preconditioning

David Hysom, Alex Pothen
2001 SIAM Journal on Scientific Computing  
We describe a parallel algorithm for computing incomplete factor (ILU) preconditioners. The algorithm attains a high degree of parallelism through graph partitioning and a two-level ordering strategy. Both the subdomains and the nodes within each subdomain are ordered to preserve concurrency. We show through an algorithmic analysis and through computational results that this algorithm is scalable. Experimental results include timings on three parallel platforms for problems with up to 20
more » ... ith up to 20 million unknowns running on up to 216 processors. The resulting preconditioned Krylov solvers have the desirable property that the number of iterations required for convergence is insensitive to the number of processors. Introduction. Incomplete factorization (ILU) preconditioning is currently among the most robust techniques employed to improve the convergence of Krylov space solvers for linear systems of equations. (ILU stands for incomplete LU factorization, where L and U are the lower and upper triangular (incomplete) factors of the coefficient matrix.) However, scalable parallel algorithms for computing ILU preconditioners have not been available despite the fact that they have been used for more than twenty years [12] . We report the design, analysis, implementation, and computational evaluation of a parallel algorithm for computing ILU preconditioners. Our parallel algorithm assumes that three requirements are satisfied. • The adjacency graph of the coefficient matrix (or the underlying finite element or finite difference mesh) must have good edge separators, i.e., it must be possible to remove a small set of edges to divide the problem into a collection of subproblems that have roughly equal computational work requirements. • The size of the problem must be sufficiently large relative to the number of processors so that the work required by the subgraph on each processor is suitably large to dominate the work and communications needed for the boundary nodes. • The subdomain intersection graph (to be defined later) should have a small chromatic number. This requirement will ensure that the dependencies in factoring the boundary rows do not result in undue losses in concurrency. An outline of the paper is as follows. In section 2, we describe the steps in the parallel algorithm for computing the ILU preconditioner in detail and provide theoretical justification. The algorithm is based on an incomplete fill path theorem; the proof and discussion of the theorem are deferred to an appendix. We also discuss *
doi:10.1137/s1064827500376193 fatcat:tffhgmf4hnga7euj5yqjxvl4ua