SUPPLE: An efficient run-time support for non-uniform parallel loops

Salvatore Orlando, Raffaele Perego
1999 Journal of systems architecture  
This paper presents SUPPLE (SUPort for Parallel Loop Execution), an innovative run{time support for the execution of parallel loops with regular stencil data references and non{uniform iteration costs. SUPPLE relies upon a static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non{uniform iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data and iterations among processors
more » ... ly if a load imbalance actually occurs. SUPPLE always tries to overlap communications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload imbalance. The SUPPLE approach has been validated by m a n y experimental results obtained by running a multi-dimensional ame simulation kernel on a 64{node Cray T 3 D . W e h a ve fed the benchmark code with several synthetic input data sets built on the basis of a load imbalance model. We h a ve compared our results with those obtained with a CRAFT Fortran implementation of the benchmark. by using collective operations on arrays 22], e.g. collective Fortran 90 operators, or by means of parallel loop constructs, i.e. loops in which iterations are declared as independent and can be executed in parallel. In this paper we are interested in run-time supports for parallel loops with regular stencil data references. In particular, we consider non{uniform parallel loops, i.e. parallel loops in which the execution time of each iteration varies considerable and cannot bepredicted statically. The typical HPF run-time support for parallel loops exploits a static data layout of arrays onto the network of processing nodes, and a static scheduling of iterations which depends on the speci c data layout. Arrays are distributed according to the directives supplied by programmers, and computations, i.e. the various loop iterations, are assigned to processors following a given rule which depends on the data layout (e.g. the owner computes rule). A BLOCK distribution is usually adopted to exploit data locality: computations mapped on a given processor by the compiler mainly use the data block allocated to the corresponding local memory. Conversely, CYCLIC distribution is usually adopted when load balancing issues are more important than locality exploitation. A combination of both the distributions, where smaller array blocks are scattered on processing nodes, can be adopted to nd a tradeo between locality exploitation and load balancing. It is worth noting, however, that the choice of the best distribution is still up to programmers, and that the right c hoice depends on the features of the particular application. The general problem of nding an optimal data layout is in fact NP{complete 9]. The adoption of a static policyto map data and computations reduces run-time overheads, because all mapping and scheduling decisions are taken at compile{time. While it produces very e cient implementations for regular concurrent problems, the code produced for irregular problems (i.e. problems where some features cannot be predicted until run-time) may b e c haracterized by poor performance. Many researches have been conducted in the eld of run-time supports and compilation methods to e ciently implement irregular concurrent problems, and these researches are at the basis of the new proposal of the HPF Forum for HPF2 5]. The techniques proposed are mainly based on run-time codes which collect information during the rst phases of computation, and then use this information to optimize the execution of the following phases of the same computation. An example of these techniques is the one adopted by the CHAOS support 20] to implement non{uniform parallel loops. The idea behind this feature of the CHAOS library is the run{time redistribution of arrays, and the consequent re{mapping of iterations. Redistribution is carried out synchronously between the execution of subsequent executions of parallel loops, and is decided on the basis of information (mainly, timing information) collected at run{time during a previous execution of the loop. If the original load distribution is not uniform, the data layout is thus modi ed in order to balance the processor loads. In 18] a dialect of HPF has been extended with new language constructs which are interfaced with the CHAOS library to support irregular computations. In this paper we address non{uniform parallel loop implementations, introducing a truly innovative support called SUPPLE (SUPort for Parallel Loop Execution). SUPPLE is not a general support on which w e can compile every HPF parallel loop, but only non{uniform (as well as uniform) parallel loops with regular stencil data references. Since stencil references are regular and known at compile time, optimizations such as message vectorization, coalescing and aggregation, as well as iteration reordering can be carried out to reduce overheads and hide communication latencies 7].
doi:10.1016/s1383-7621(98)00071-x fatcat:kiu4ma2g5bf2na72mnmf6xsatq