Advanced optimization strategies in the Rice dHPF compiler
Concurrency and Computation
High Performance Fortran (HPF) was envisioned as a vehicle for modernizing legacy Fortran codes to achieve scalable parallel performance. To a large extent, today's commercially available HPF compilers have failed to deliver scalable parallel performance for a broad spectrum of applications because of insufficiently powerful compiler analysis and optimization. Substantial restructuring and handoptimization can be required to achieve acceptable performance with an HPF port of an existing Fortran
... application, even for regular data-parallel applications. A key goal of the Rice dHPF compiler project has been to develop optimization techniques that enable a wide range of existing scientific applications to be ported easily to efficient HPF with minimal restructuring. This paper describes the challenges to effective parallelization presented by complex (but regular) data-parallel applications, and then describes how the novel analysis and optimization technologies in the dHPF compiler address these challenges effectively, without major rewriting of the applications. We illustrate the techniques by describing their use for parallelizing the NAS SP and BT benchmarks. The dHPF compiler generates multipartitioned parallelizations of these codes that are approaching the scalability and efficiency of sophisticated handcoded parallelizations. For loop nests with complex data dependences, such as the example of Figure 4 , we have developed an algorithm to eliminate inner-loop communication without excessive loss of cache reuse. The algorithm • The compiler vectorizes communication for arbitrary regular communication patterns. Communication is vectorized out of any loop as long as doing so will not cause any loop-carried or loop-independent data dependence to be violated. • The compiler can coalesce messages for arbitrary affine references to a data array. Any two communication events at a point in a program that are derived from different references to the same array will be coalesced if the data sets for the references overlap and both communication events involve the same communication partners. This optimization significantly reduces communication frequency. • The compiler further reduces message frequency by aggregating communication events for affine references for disjoint sections of an array or different arrays if both communication events occur at the same place in the code and involve the same communication partners . • The compiler-generated code and supporting runtime library use asynchronous communication primitives for latency and asynchrony tolerance . • The compiler generates code which implements a simple array-padding scheme that eliminates most intra-array conflict misses , thereby improving cache performance. ADVANCED OPTIMIZATION STRATEGIES IN THE DHPF COMPILER 23 Figure 10. dHPF-generated NAS BT using 3D multipartitioning. the computation of the privatizable temporary arrays rho q, qs, us, vs, ws, and square along the boundaries of a multipartitioned tile avoided communication of these six variables. No additional communication was needed to partially replicate this computation because the boundary planes of the multipartitioned u array needed by the replicated computation were already being communicated in this routine. (The redundant communication is eliminated using an additional optimization, namely communication coalescing , which was not described here.) Together, these optimizations cut the communication volume of compute rhs by nearly half. In BT's lhsx, lhsy, and lhsz subroutines, partially replicating computation along the partitioning boundaries of two arrays, fjac and njac, whose global dimensions are (5, 5, IMAX, JMAX, KMAX), reduced communication by a factor of five. Rather than communicating planes of computed values for these arrays across partitions in the i, j, and k dimensions, we communicated sections of rhs(5, IMAX, JMAX, KMAX), which is a factor of five smaller, to replicate the computation along the partitioning boundaries. The combined impact of these optimizations is that, for a 16-processor class A execution, dHPF had only 1.5% higher communication volume, and 20% higher message frequency than the hand-coded implementation. The number and frequency of the MPI messages generated by the compiler-generated BT code is very close to the corresponding pattern of MPI messages of the hand-coded version. The scalar performance of the two versions is also comparable, hence the small performance differential between the hand-coded version and the dHPF generated version. Figure 10 shows a 16-processor parallel execution trace for one steady-state iteration of the dHPF generated code for the BT class A sized benchmark. The corresponding hand-coded trace is shown in Figure 11 . The traces show that the major communication phases are similar and occur in the same order. The principal differences are longer elapsed times in the hand-coded multipartitioning between sends and their corresponding receives, which was achieved by placing a section of the computation that does not depend on communication between the send and receive.