A global communication optimization technique based on data-flow analysis and linear algebra

M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, N. Shenoy
1999 ACM Transactions on Programming Languages and Systems  
Reducing communication overhead is extremely important in distributed-memory messagepassing architectures. In this article, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based on a combination of traditional data-flow analysis and a linear algebra framework, and it works on structured programs with conditional statements and nested loops but without arbitrary goto statements. The distinctive features of the solution
more » ... of the solution are the accuracy in keeping communication set information, support for general alignments and distributions including block-cyclic distributions, and the ability to simulate some of the previous approaches with suitable modifications. We also show how optimizations such as message vectorization, message coalescing, and redundancy elimination are supported by our framework. Experimental results on several benchmarks show that our technique is effective in reducing the number of messages (an average of 32% reduction), the volume of the data communicated (an average of 37% reduction), and the execution time (an average of 26% reduction). Most of these approaches use a variant of Regular Section Descriptors (RSD) introduced by Callahan and Kennedy [1998]. Two most notable representations are the Available Section Descriptor (ASD) [Gupta et al. 1995b ] and Section Communication Descriptor (SCD) [Yuan et al. 1997a; 1997b] . Associated with each array that is referenced in the program is an RSD that describes the portion of the array being referenced. Although this representation is convenient for simple array sections such as those found in pure block or cyclic distributions, it is hard to embed alignment and general distribution information into it. Apart from inadequate support for block-cyclic distributions, working with section descriptors may sometimes result in overestimation of the communication sets, since regular sections are not closed under union and difference operators. The resulting inaccuracy may be linear with the number of data-flow formulations to be evaluated, thus defeating the purpose of global communication optimization. This problem can be illustrated using the program fragment given in Figure 2 (a) assuming that arrays X and Y are distributed blockwise across two processors, 0 and 1. The RSDs corresponding to these two communications are also shown next to the loop statements. Notice that all communication is from processor 0 to processor 1. The problem here is that a data-flow approach based on RSDs to combine these communications will be unable to represent the combined communication as an RSD. This means that even if all the communication can be hoisted above the i loop, the two communications can only be concatenated, resulting in redundant communication as these two sets have some common elements. Moreover, since the communication cannot be taken out of t loop because of a data dependence [Wolf 1996; Zima and Chapman 1991] , the redundant communication will occur T times. On the other hand, we represent these sets in our framework as S i :ϭ {[d]Ϻ?(␣Ϻd ϭ 1 ϩ 4␣ and 1 Յ d Յ 197)} and S j :ϭ {[d]Ϻ?(␣Ϻ1 ϩ d ϭ 3␣ and 50 Յ d Յ 299)}. Then by using the Omega library [Kelly et al. 1995], we derive the code shown in Figure 2 (b) which can enumerate all the elements in S i ϩ S j . As a result, each element will be communicated once and only 1254 • M. Kandemir et al.
doi:10.1145/330643.330647 fatcat:5pwmebaefrfflmkar3qt6pfdti