Global address space, non-uniform bandwidth: a memory system performance characterization of parallel systems

T. Stricker, T. Cross
Proceedings Third International Symposium on High-Performance Computer Architecture  
Many parallel systems offer a simple view of memory: all storage cells are addressed uniformly. Despite a uniform view of the memory, the machines differ significantly in their memory system performance (and may offer slightly different consistency models). Cached and local memory accesses are much faster than remote read accesses to data generated by another processor or remote write to data intentionally pushed to memories close to another processor. The bandwidth from/to cache and local
more » ... y can be an order of magnitude (or more) higher than the bandwidth to/from remote memory. The situation is further complicated by the heavy influence of the access pattern (i.e. the spatial locality of reference) on both the local and the remote memory system bandwidth. In these modern machines, a compiler for a parallel system is faced with a number of options to accomplish a data transfer most efficiently. The decision for the best option requires a cost benefit model, obtained in an empirically evaluation of the memory system performance. We evaluate three DEC Alpha based parallel systems, to demonstrate the practicality of this approach. The common DEC-Alpha processor architecture facilitates a direct comparison of memory system performance. These systems are the DEC 8400, the Cray T3D, and the Cray T3E. The three systems differ in their clock speed, their scalability and in the amount of coherency they provide. The DEC 8400 is a shared memory, symmetric multiprocessor based on a high speed bus offering sequential consistency; the Cray T3D and T3E are scalable multicomputers based on a scalable 3D torus interconnect and either do not cache remote accesses at all (T3E) or provide only partial memory consistency within a node (T3D) and therefore typically leave consistency to the application or compiler. Our performance characterization shows that although the clock rate of the DEC 8400 doubled compared to the Cray T3D, the DEC 8400 offers only modest improvements in the performance of remote memory operations over the Cray T3D. The local and remote memory system performance of the Cray T3E
doi:10.1109/hpca.1997.569658 dblp:conf/hpca/StrickerG97 fatcat:cd33bx24afgcbfoo232sqppmhm