General hardware multicasting for fine-grained message-passing architectures

Matthew Naylor, Simon W. Moore, David Thomas, Jonathan R. Beaumont, Shane Fleming, Mark Vousden, A. Theodore Markettos, Thomas Bytheway, Andrew Brown
2021 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)  
Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in some application domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee
more » ... to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertexcentric programming model. • Each programmable router makes routing decisions using a 1024-entry content-addressable memory (CAM), which is too small to capture large unstructured communication patterns precisely. Instead, packets can be delivered approximately to a superset of the desired destinations, and software at the receivers can decide whether or not to discard them. Naturally, this leads to more traffic on the network than is necessary. It also means that multicast communication requires software assistance, and software disposal of unwanted messages is a significant overhead. • The hardware does not provide guaranteed delivery. There is no hardware-enforced flow control and packets are dropped when the communication fabric is overwhelmed. Dropped packets are "retained in a buffer for software examination" [2], but software retransmission schemes are complex and will only lead to more bookkeeping on the cores and more traffic on the network. In this paper, inspired by the SpiNNaker design, we explore new features for hardware multicasting that are precise, reliable, and generally applicable to a range of HPC domains. Our contributions are as follows. • We describe the drawbacks of implementing one-tomany communication patterns in software, i.e. software multicasting, especially while guaranteeing delivery. • We present new techniques for local and distributed hardware multicasting, implemented on top of a many-
doi:10.1109/pdp52278.2021.00028 fatcat:sb2fafsaifdvjhqi3pkfjadnae