Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs

Martin Kong, Antoniu Pop, Louis-Noël Pouchet, R. Govindarajan, Albert Cohen, P. Sadayappan
2015 ACM Transactions on Architecture and Code Optimization (TACO)  
Task-parallel languages are increasingly popular. Many of them provide expressive mechanisms for inter-task synchronization. For example, OpenMP 4.0 will integrate data-driven execution semantics derived from the StarSs research language. Compared to the more restrictive data-parallel and fork-join concurrency models, the advanced features being introduced into task-parallel models in turn enable improved scalability through load balancing, memory latency hiding, mitigation of the pressure on
more » ... f the pressure on memory bandwidth, and as a side effect, reduced power consumption. In this paper, we develop a systematic approach to compile loop nests into concurrent, dynamically constructed graphs of dependent tasks. We propose a simple and effective heuristic that selects the most profitable parallelization idiom for every dependence type and communication pattern. This heuristic enables the extraction of inter-band parallelism (cross barrier parallelism) in a number of numerical computations that range from linear algebra to structured grids and image processing. The proposed static analysis and code generation alleviates the burden of a full-blown dependence resolver to track the readiness of tasks at run time. We evaluate our approach and algorithms in the PPCG compiler, targeting OpenStream, a representative data-flow task-parallel language with explicit inter-task dependences and a lightweight runtime. Experimental results demonstrate the effectiveness of the approach. 39:2 M. Kong et al. the ready tasks to worker threads [Planas et al. 2009a; Budimlic et al. 2010; Bosilca et al. 2012; Pop and Cohen 2013]. In particular, runtimes which follow the data-flow model of execution and point-to-point synchronization do not involve any of the drawbacks of barrier-based parallelization patterns: tasks can execute as soon as the data becomes available (i.e., when dependences are satisfied) and lightweight scheduling heuristics exist to improve the locality of this data in higher levels of the memory hierarchy; no global consensus is required and relaxed memory consistency can be leveraged to avoid spurious communications; loop skewing is not always required, and wavefronts can be built dynamically without the need of an outer serial loop. Loop transformations for the automatic extraction of data parallelism have flourished. Unfortunately, the landscape is much less explored in the area of task parallelism extraction, and in particular the mapping of tiled iteration domains to dependent tasks. This paper makes three key contributions: Algorithmic. We design a task parallelization scheme following a simple but effective heuristic to select the most profitable synchronization idiom to use. This scheme exposes concurrency and favors temporal reuse across distinct loop nests (a.k.a. dynamic fusion), and further partitions the iteration domain according to the input/output signatures of dependences. Thanks to this compile-time classification, much of the runtime effort to identify dependent tasks is eliminated, allowing for a very lightweight and scalable task-parallel runtime. Compiler construction. We implement the above algorithm in a state-of-the-art framework for affine scheduling and polyhedral code generation, targeting the OpenStream research language [Pop and Cohen 2013] . Unlike the majority of the task-parallel languages, OpenStream captures point-to-point dependences between tasks explicitly, reducing the work delegated to the runtime by making it independent of the number of waiting tasks. Experimental. We demonstrate strong performance benefits of task-level automatic parallelization over state of the art data-parallelizing compilers. These benefits derive from the elimination of synchronization barriers and from a better exploitation of temporal locality across tiles. We further characterize these benefits within and across tiled loop nests. We illustrate these concepts on a motivating example in Sec. 2 and introduce background material in Sec. 3. We present our technique in detail in Sec. 4, and evaluate the combined compiler and runtime task-parallelization to demonstrate the performance benefits over a data-parallel execution in Sec. 5. Related work is discussed in Sec. 6. long band_stream_ii_size = ( floor ((15 + ni )/16)); int band_stream_ii [ band_stream_ii_size ] __attribute__ (( stream )); int read_window [W ]; int write_window [W ]; for (int ii = 0; ii <= floord ( ni -1, 16); ii += 1) #pragma omp task output ( band_stream_ii [ ii ] << write_window [W ]) for (int jj = 0; jj <= floord ( nj -1, 16); jj += 1) for (int i = 16 * ii ; i <= min ( ni -1, 16 * ii + 15); i += 1) for (int j = 16 * jj ; j <= min ( nj -1, 16 * jj + 15); j += 1) C[i ][ j] *= beta ; for (int ii = 0; ii <= floord ( ni -1, 16); ii += 1) #pragma omp task input ( band_stream_ii [ ii ] >> read_window [W ]) for (int jj = 0; jj <= floord ( nj -1, 16); jj += 1) for (int kk = 0; kk <= floord ( nk -1, 16); kk += 1) for (int i = 16 * ii ; i <= min ( ni -1, 16 * ii + 15); i += 1) for (int j = 16 * jj ; j <= min ( nj -1, 16 * jj + 15); j += 1) for (int k = 16 * kk ; k <= min ( nk -1, 16 * kk + 15); k += 1) C[i ][ j] += (( alpha * A[i ][ k ]) * B[k ][ j ]);
doi:10.1145/2687652 fatcat:bnfyp322v5dlffowuyqu7yfire