Explicit Communication and Synchronization in SARC

Manolis Katevenis, Vassilis Papaefstathiou, Stamatis Kavadias, Dionisios Pnevmatikatos, Federico Silla, Dimitrios Nikolopoulos
2010 IEEE Micro  
SARC merges cache controller and network interface functions by relying on a single hardware primitive: each access checks the tag and the state of the addressed line for possible occurrence of events that may trigger responses like coherence actions, RDMA, synchronization, or configurable event notifications. The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues. The runtime system maps
more » ... communication abstractions of the programming model to data transfers among local memories using remote write or read DMA and into task synchronization and scheduling using notifications, counters, and queues. The on-chip network provides efficient communication among these configurable memories, using advanced topologies and routing algorithms, and providing for process variability in NoC links. We simulate benchmark kernels on a full-system simulator to compare speedup and network traffic against cache-only systems with directory-based coherence and prefetchers. Explicit communication provides 10 to 40% higher speedup on 64 cores, and reduces network traffic by factors of 2 to 4, thus economizing on energy and power; lock and barrier latency is reduced by factors of 3 to 5. EXPLICIT COMMUNICATION AND NETWORK INTERFACE EVOLUTION Interprocessor communication (IPC) is the basis of parallel processing. IPC can be implicit, when the addresses supplied by the software do not identify physical data locations or (time of) movement, or it can be explicit, when software (the application, or compiler, or runtime system) is able to also indicate physical placement or transfers, besides specifying computation on data. The SARC architecture [1], supports both implicit IPC, through cache coherence, for ease of programming, and explicit IPC, through scratchpad memories and remote store instructions or remote DMA operations, to be used by software whenever possible for achieving scalable performance. In order to hide IPC latency, when using implicit communication, we need large issue windows in out-of-order-execution processors, or sophisticated data prefetchers, or both. Explicit communication has the potential to better hide IPC latency, in those cases when software knows better than hardware what transfers need to take place and when. Remote store instructions, to addresses that indicate proximity to the consumer, when that is known at production time, will transfer data at the earliest possible time; hardware should coalesce writes to adjacent targets into few network packets, and the processor should not wait for the arrival acknowledgments. Remote direct memory access (RDMA) is the other method for explicit communication, in cases that require either reads -when the consumer is unknown or unavailable at production time-or multi-word writes -to achieve good coalescence. Traditional systems viewed networks as external (slow) devices, provided DMA in the network interface (NI), and interacted to it through (slow) input/output (I/O) operations. This is inappropriate for modern systems that incorporate networks on-chip (NoC); in them, RDMA must be accessible through a lowlatency, virtualized, user-level interface, as opposed to system calls. Within the SARC architecture's global virtual address space, explicit communication is based on directly-addressable scratchpad local memories. c IEEE 2010 -to appear in IEEE Micro September/October 2010 issue 1 Address translation provides processors, accelerators, and tasks with controlled, protected access to selected portions of this global space, including (portions of) local and remote scratchpads. Within this scratchpad space, software can allocate special areas (as many as it wishes) that behave as command buffers, counters, or queues, with event response capabilities. Command buffers are used to issue remote (write or read) DMA operations; counters and queues are used for synchronization, including RDMA completion detection, notifications, and waiting for events. Our approach relies on the same basic principle behind cache operation: for each read or write access, check the tag and the state of the addressed line; for certain combinations of state and access type, sideeffect actions must be performed -coherence protocol actions, or RDMA, or synchronization, or event responses and notifications. The contributions of this work are: (i) we architect a network interface that leverages on-chip communication potential with hardware support for synchronization and explicit communication, (ii) we introduce the mechanism of event-response applied to cache line state and tag bits, and use it to unify the cache controller and the network interface; (iii) we offer a brief overview of contributions of the SARC project in the field of NoC architecture (Section 3); and (iv) we use full-system simulation to show that remote stores and remote DMAs achieve performance speedup, reduce latency and dramatically reduce network traffic as compared to directory-based cache coherence even with prefetching (Section 4).
doi:10.1109/mm.2010.77 fatcat:jzsphc2sqrgpfh6rxdgvuswv5y