Compiling for Scalable Multiprocessors with Polaris

Yunheung Paek, David A. Padua
1997 Parallel Processing Letters  
Due to the complexity of programming scalable multiprocessors with physically distributed memories, it is onerous to manually generate parallel code for these machines. As a consequence, there has been much research o n t h e development of compiler techniques to simplify programming, to increase reliability, and to reduce development costs. For code generation, a compiler applies a number of transformations in areas such as data privatization, data copying and replication, synchronization, and
more » ... data and work distribution. In this paper, we discuss our recent w ork on the development and implementation of a few compiler techniques for some of these transformations. We use Polaris, a parallelizing Fortran restructurer developed at Illinois, as the infrastructure to implement our algorithms. The paper includes experimental results obtained by applying our techniques to several benchmark codes. Research supported in part by A r m y contract #DABT63-95-C-0097. This work is not necessarily representative of the positions or policies of the Army o r t h e G o vernment. 1 memory machines have noncoherent c a c hes 3,16]. This multiplies the di culty o f programming these machines because, to compensate for the lack of hardware cache coherence mechanisms, explicit control of cacheability and coherency is critical for fast execution. This programming work can be done manually however, automatic techniques not only facilitate program development, but also allow the source program to be more readable without sacri cing e ciency. Additionally, more readable programs are easier to debug and maintain. In this paper, we discuss our work at Illinois on the development of compiler techniques for scalable shared memory multiprocessors with noncoherent c a c hes. Speci cally, w e h a ve d e v eloped techniques to translate conventional Fortran programs for e cient parallel execution on the the Cray T3D, the only commercial machine of this class available today. In fact, we h a ve found in this work that multiprocessors with noncoherent caches have important a d v antages over their cache coherent counterparts. For example, non-cache coherent machines are easier to scale and are more economical 3]. To optimize communication costs in multiprocessors, it is often necessary for software to have explicit control over data movement. On cache coherent m a c hines, controlling data movement c a n b e c u m bersome unless the machine includes mechanisms to override the hardware cache controller. Non-cache coherent m a c hines, by contrast, allow the programmer or the compiler to have explicit and direct control over communications through explicit data movement operations. Having explicit communication control results in other advantages 12], such a s a s u b s t a n tial reduction in communication costs from prefetching, data pipelining, and aggregation 11]. In this work, we pro t from the fact that the Cray T3D supports fast single-sided communication in the form of PUT/GET primitives. The target language of our translator is CRAFT 6] augmented with libraries that provide single-sided communication primitives. Our work extends the parallelization techniques implemented in the Polaris restructurer 4], which w as developed by the authors and other re-
doi:10.1142/s0129626497000413 fatcat:gta4lqs46raixl4hbppfpgbnay