### An in-depth introduction of multi-workgroup tiling for improving the locality of explicit one-step methods for ODE systems with limited access distance on GPUs

Matthias Korch, Tim Werner
2020 Concurrency and Computation
This article considers a locality optimization technique for the parallel solution of a special class of large systems of ordinary differential equations (ODEs) by explicit one-step methods on GPUs. This technique is based on tiling across the stages of the one-step method and is enabled by the special structure of the class of ODE systems considered, that is, the limited access distance. The focus of this article is on increasing the range of access distances for which the tiling technique can
more » ... iling technique can provide a speedup by joining the memory resources and the computational power of multiple workgroups for the computation of one tile (multi-workgroup tiling). In particular, this article provides an extended in-depth introduction and discussion of the multi-workgroup tiling technique and its theoretical and technical foundations together with a new tuning option (mapping stride) and new experiments. The experiments performed show speedups of the multi-workgroup tiling technique compared with traditional single-workgroup tiling for two different Runge-Kutta methods on NVIDIAs Kepler and Volta architectures. K E Y W O R D S GPUs, hexagonal tiling, limited access distance, multi-workgroup tiling, ODE methods, RK methods, trapezoidal tiling INTRODUCTION Many scientific simulations use systems of differential equations as mathematical models to approximate phenomena of the real world. This article considers initial value problems (IVPs) of systems of ordinary differential equations (ODEs): 1 y ′ (t) = f(t, y(t)), y(t 0 ) = y 0 , t ∈ [t 0 , t e ]. (1) The classical numerical solution approach, which is also used by the methods considered in this article, applies a time-stepping procedure that starts the simulation at time t 0 with initial state y 0 and performs a series of time steps t → t +1 for = 0, 1, 2, ... until the final simulation time t e is reached. At each time step , by applying the right-hand-side (RHS) function f ∶ R × R n → R n , a new simulation state y +1 is computed which approximates the exact solution function y(t) at time t +1 . We consider the parallel solution of IVPs by explicit one-step methods on GPUs, which-in contrast to implicit and multi-step methods-only use one input approximation y and a number of evaluations of the RHS function to compute the output approximation y +1 . Since for many IVPs ODE methods are memory bound, optimizing the locality of memory references is important. To reduce the time-to-solution on GPUs, we use the This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.