Efficient and Predictable Group Communication for Manycore NoCs
Lecture Notes in Computer Science
Massive manycore embedded processors with network-on-chip (NoC) architectures are becoming common. These architectures provide higher processing capability due to an abundance of cores. They provide native core-to-core communication that can be exploited via message passing to provide system scalability. Despite these advantages, manycores pose predictability challenges that can affect both performance and real-time capabilities. In this work, we develop efficient and predictable group
... tion using message passing specifically designed for large core counts in 2D mesh NoC architectures. We have implemented the most commonly used collectives in such a way that they incur low latency and high timing predictability making them suitable for balanced parallelization of scalable high-performance and embedded/real-time systems alike. Experimental results on a single-die 64 core hardware platform show that our collectives can significantly reduce communication times by up to 95% for single packet messages and up to 98% for longer messages with superior performance for sometimes all message sizes and sometimes only small message sizes depending on the group primitive. In addition, our communication primitives have significantly lower variance than prior approaches, thereby providing more balanced parallel execution progress and better real-time predictability. send messages concurrently, yet without contention, to reduce communication latency. This neither requires dynamic computation of a routing schedule nor incurs scheduling overhead or memoization of large routing tables. Our implementation uses message passing over the NoC of a TilePro64 and Intel SCC but is generic enough to be adopted to any 2D mesh NoC. Experimental results on the TilePro hardware platform show that our implementation has lower latencies and less timing variability (lower variance) than prior work. We compared the performance of our implementation in micro-benchmarks against OperaMPI , a reference MPI implementation for the Tilera platform. Performance improvements of up to 95% are observed in communication for single packet messages with significantly higher timing predictability (lower variance), which supports more balanced execution progress for high-performance computing (HPC) and helps meet deadlines in embedded/real-time scenarios. Our port to the Intel SCC achieves similar results compared to the vendor libraries . Design and Implementation Our work assumes a generic, generalized 2D mesh NoC switching architecture similar to existing fabrics with high core counts    9] . Each core is composed of a compute core, network switch, and local caches. NoC Message Layer (NoCMsg): Our implementation provides an MPI-style message passing interface for NoCs. This facilitates basic point-to-point communication and supports our group communication. The NoC message layer implementation optionally provides flow control support. In our design, we turn off flow control when not required by program logic to further improve performance. Group Communication Primitives: The key ideas behind our design of group communication primitives are to (1) reduce contention in the NoC; (2) exploit patternbased communication to exchange messages concurrently; (3) reduce the number of messages by aggregation; and (4) leverage hardware features to improve performance. Due to these objectives, it is not feasible to simply resort to binomial trees for most collectives or other algorithms such as recursive doubling for allreduce since these algorithms are contention agnostic and will result in reduced performance over contentionsensitive NoCs. We implemented the group communication on the Tilera TilePro64 and ported it to the Intel SCC  to demonstrate that our implementation is generic and can be extended to any 2D mesh NoC architectures.