Communication optimizations for fine-grained UPC applications
14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05)
Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finergrained sharing can be useful for scenarios such as dynamic load balancing, event signaling, and distributed hash tables. In this paper we present three optimization techniques for global address space programs with
... grained communication: redundancy elimination, use of split-phase communication, and communication coalescing. Parallel UPC programs are analyzed using static single assignment form and a dataflow graph, which are extended to handle the various shared and private pointer types that are available in UPC. The optimizations also take advantage of UPC's relaxed memory consistency model, which reduces the need for cross thread analysis. We demonstrate the effectiveness of the analysis and optimizations using several benchmarks, which were chosen to reflect the kinds of finegrained, communication-intensive phases that exist in some larger applications. The optimizations show speedups of up to 70% on three parallel systems, which represent three different types of cluster network technologies. • Split-phase communication: On messagepassing networks, communication routines are split-phase by nature; an init call initiates the operation, and a subsequent sync call ensures the delivery of data on the remote side. By separating the initiation of a remote memory access as far away as possible from its completion, its latency can be hidden through the overlapping of communication and computation as well as message pipelining. This capability is especially relevant for UPC, which currently offers no nonblocking communication operations at the language level.