Synchronization and communication in the T3E multiprocessor
ACM SIGOPS Operating Systems Review
This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization. The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external
... registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/ eureka networks that can be arbitrarily embedded into the 3D torus interconnect. Microprocessors often lack sufficiently large physical address spaces for use in large-scale machines. The DEC 21064, for example, implements a 33-bit physical address 1 , while the maximum physical memory in the T3D, is over 128 GB. TLB reach is another potential problem. A TLB that is sufficiently large for a powerful workstation may be insufficient for a machine with a thousand processors and a terabyte of physical memory. Microprocessors are designed to cache data that they reference. While this is usually beneficial, it is sometimes desirable to make non-cached references to memory. When writing to another processor's memory in a message-passing program, for example, it is far better for the data to end up in the recipient processor's memory than in the sending processor's cache! In general, microprocessors are designed with an emphasis on latency reduction rather than latency toleration. While this is an effective approach for many codes, it is ineffective for scientific codes with poor locality, and it does not support high-bandwidth communication in large-scale multiprocessors. This paper discusses the Cray T3E multiprocessor, which is based on the DEC Alpha 21164 microprocessor. We describe the "shell" that surrounds the processor to make it fit comfortably into a kiloprocessor machine, and discuss features designed to support highly-parallel, fine-grained programming. The paper focuses on communication and synchronization, giving little consideration to the processor, network, memory system or I/O system.