Evaluation of the deflated preconditioned CG method to solve bubbly and porous media flow problems on GPU and CPU
International Journal for Numerical Methods in Fluids
In both bubbly and porous media flow, the jumps in coefficients may yield an ill-conditioned linear system. The solution of this system using an iterative technique like the conjugate gradient (CG) is delayed because of the presence of small eigenvalues in the spectrum of the coefficient matrix. To accelerate the convergence, we use two levels of preconditioning. For the first level, we choose between out-of-the-box incomplete LU decomposition, sparse approximate inverse, and truncated Neumann
... eries-based preconditioner. For the second level, we use deflation. Through our experiments, we show that it is possible to achieve a computationally fast solver on a graphics processing unit. The preconditioners discussed in this work exhibit fine-grained parallelism. We show that the graphics processing unit version of the two-level preconditioned CG can be up to two times faster than a dual quad core CPU implementation. 667 In (2), A is the coefficient matrix, x is the unknown pressure, and b is the right-hand side. The coefficient matrix A is highly ill-conditioned, sparse, and symmetric positive (semi) definite (SPD). The ill-conditioning results from the sharp contrasts in material properties (density and permeability) that are inherent to such problems. Because the matrix A is sparse and large, using iterative methods is preferable over direct methods. The conjugate gradient (CG) method is most suited because of the nature of the coefficient matrix. Through this work, we want to show that for bubbly and porous media flow, we can use deflation with preconditioning to accelerate convergence on the graphics processing units (GPUs). In addition, we also want to provide a comparison with an optimized CPU implementation and show that GPUs can be advantageous. We consider a variety of first-level preconditioning techniques  along with deflation  for a two-level preconditioned conjugate gradient (PCG) implementation. On the one hand, we apply two-level preconditioning to accelerate convergence. On the other hand, we focus on implementing PCG method on the GPU. The GPU devices provide higher performance both in terms of memory bandwidth and in terms of computational power in comparison with traditional CPUs. For efficient utilization of the GPUs, the algorithms need to possess fine-grained parallelism. This is a critical aspect, and for this task, it is important to rethink, tailor, or even devise new preconditioning techniques that suit the GPU architecture and can be computationally efficient on the GPU. In this paper, we use the deflated preconditioned conjugate gradient (DPCG) implemented within the PARALUTION library  . PARALUTION provides various sparse iterative solvers and preconditioners (including DPCG). It includes several preconditioners based on incomplete LU-type decomposition, which we have used in this paper. Deflated PCG method for bubbly flow has been previously studied in    . In these papers, an incomplete Cholesky preconditioner is suggested for the first-level preconditioning and deflation for the second-level preconditioning. The authors point out the methods for effective deflation by studying a variety of deflation vectors. However, in this approach, incomplete Cholesky preconditioning is a bottleneck in a GPU implementation of the DPCG method. An implementation of the DPCG method on the GPU with a preconditioner exhibiting fine-grained parallelism was first reported in  , and better variants of deflation vectors were studied in  . Both these works implemented DPCG on the GPU. For composite materials, DPCG has been previously studied in  . Preconditioners that are suitable for GPU implementation of the PCG method have also been discussed in [3, 11, 12] . The matrix storage formats that we mention in this work are discussed in detail in  along with their GPU implementations. In the next two sections, we describe the preconditioning techniques that we use on both levels. In Section 4, we explain the main building blocks of the preconditioning schemes used on a GPU. In Section 5, we present our numerical experiments and analyze the performance for both the problems that we consider. In the final section, we present our conclusions. FIRST-LEVEL PRECONDITIONING TECHNIQUES In this section, we describe the first-level preconditioning techniques that we use to accelerate convergence of CG. They exhibit fine-grained parallelism and hence can be executed efficiently on a GPU. We begin with preconditioners based on classical incomplete LU-based factorizations followed by approximate inverse-based preconditioners, and finally, we explain the truncated Neumann series (TNS)-based preconditioners. We note that PARALUTION uses ILU factorizations even for the symmetric problems because of the following: The memory footprint is the same as we need to store (in all cases) the U part (in Cholesky U D L T ). This is because calculating transposes for different storage formats is computationally expensive. This is a design choice in PARALUTION and is true for all situations where the transpose is needed. 669 till we come to the last step when we have to provide a solution. The last block of the preconditioner can be solved with Jacobi, symmetric Gauss-Seidel (SGS), or ILU(p; q) preconditioner. Therefore, we have three flavors of multi-elimination LU-based method, which in our results are abbreviated as ME-ILU-J, ME-ILU-SGS, and ME-ILU-ILU.0; 1/. Multi-colored symmetric Gauss-Seidel We parallelize the SGS preconditioner using a multi-coloring decomposition of the original matrix A. The multi-coloring exposes additional parallelism in the forward and backward substitutions. This gives us a block structure in the preconditioner M WD .D C L/D 1 .D C L T /, where L is the strict lower triangular part of A, the coefficient matrix, and D is the diagonal of A. For details, see  . In our results, this preconditioner is abbreviated as MCSGS. Truncated Neumann series-based preconditioning We can define another factorization-based preconditioner continuing on the idea of the decomposition used for the SGS-type preconditioner. In this case, we also approximate the inverse of the factors. The preconditioner can be written as Layered problem. The number of unknowns is 63 64 64 15.