Data distribution support on distributed shared memory multiprocessors
Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallel computing. However; obtaining high pe$ormance on these machines mquires that an application execute with good data locality. In addition to making efiective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses. While processor caches can exploit temporal locality on both local
... remote data, many applications, such as those without temporal reuse or with working sets larger than the cache, are unable to benefit from cache locality alone. To obtain high performance on such applications, it is often necessary to distribute the data structures in the program so that the cache misses of each processor are more likely to be satisfied from local rather than remote memory. We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGZ MZPSpro compiler system. Our system incorporates many unique features to enhance both ptogrammability and performance. We address the former by providing a very simple ptvgmmming model with extensive support for emor detection. Reganiing performance, we carefully design the user abstractions with the wuierlying compiler optimitations in mind, we incorporate several optimization techniques to generate eJEcient code for accessing distributed data, and we provide a tight integration of these techniques with other optimizations within the compiler Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.