Benchmarking the computation and communication performance of the CM-5
Concurrency Practice and Experience
Thinking Machines' CM-5 machine is a distributed-memory, message-passing computer. In this paper we devise a performance benchmark for the base and vector units and the data communication networks of the CM-5 machine. We model the communication characteristics such as communication latency and bandwidths of point-to-point and global communication primitives. We show, on a simple Gaussian elimination code, that an accurate static performance estimation of parallel algorithms is possible by using
... those basic machine properties connected with computation, vectorization, communication, and synchronization. Furthermore, we describe the embedding of meshes or hypercubes on the CM-5 fat-tree topology and illustrate the performance results of their basic communication primitives. The CM-5 is a parallel distributed-memory machine that can scale up to 16,384 processing nodes. Each n o d e c o n tains a SPARC microprocessor, a custom network interface, a local memory up to 128 MBytes, and either a memory controller or vector controller units. The processing nodes are connected by three networks: the diagnostics network which i d e n ti es and isolates errors throughout the system the high speed data network, which communicates bulk data and the control network, which is mainly responsible for the operations requiring the participation of all nodes simultaneously, such as broadcasting and synchronization. As data communication between two nodes can be performed by using either the data network or the control network, we restrict our analysis to these two. In making this study we h a ve t wo objectives. The rst includes evaluating the computation and communication performance of the CM-5 and modeling the system parameters such as computational processing rate, communication start-up time, and the latency and data transfer bandwidth. The fundamental measurement made in our benchmark programs is the elapsed time for completing some speci c tasks or for completing a communication operation. All other performance gures are derived from this basic timing measurement. Second, we w ant t o i n vestigate the feasibility and e ciency of embedding other kinds of network topologies into the CM-5 fat-tree topology and to devise a benchmark for the basic communication primitives of those topologies on the CM-5. There is an enormous number of parallel algorithms for di erent t ypes of network topologies in the literature 8, 17]. We address the problem of e ciently embedding meshes and hypercubes into the fat-tree topology, and we present timings for basic mesh and hypercube primitives. Our benchmarking study shows that these embeddings give e cient results and that many algorithms can be transported to the CM-5 with little or no change. The results of our study make it possible to predict the performance of parallel algorithms without actually running them on the CM-5. We present a Gaussian elimination code and give the corresponding real and estimated execution times in order to show the accuracy of the estimated performance gures. Related Work There are numerous articles in the literature about benchmarking di erent aspects of recent parallel architectures or supercomputers 3, 4, 11, 12, 13, 14, 16] . There are also several benchmark suits specially developed to provide a common ground to test the performance of di erent high-performance computers 1, 2, 10, 15]. Some of them investigate the use of real application programs, while others employ short kernel codes to evaluate the performance, just as we do here.