NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet [chapter]

Brice Goglin
2009 Lecture Notes in Computer Science  
High-speed networking in clusters usually relies on advanced hardware features in the NICs, such as zero-copy capability. Open-MX is a high-performance message passing stack tailored for regular Ethernet hardware without such capabilities. We present the addition of a multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are handled on the same core. We then introduce the idea of binding the target end process near its dedicated receive queue. This
more » ... el leads to a more cache-efficient receive stack for Open-MX. It also proves that very simple and stateless hardware features may have a significant impact on message passing performance over Ethernet. The implementation of this model in a firmware reveals that it may not be as efficient as some manually tuned micro-benchmarks. But our multiqueue receive stack generally performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult. Several research works were carried out in the context of high-performance message passing over Ethernet as a way to improve the overall parallel computing performance without requiring expensive networking hardware. GAMMA [5] or EMP [22] only work on a limited spectrum of hardware since they use modified drivers or hardware. Our Open-MX [10] stack is another message passing model implemented on top of the Ethernet software layer of the Linux kernel. It offers high-performance communication over any generic Ethernet hardware using the wire specifications and the application programming interface of Myrinet Express [18] . However, like QMP [3] and PM [23] (or any other software-based message passing), being compatible with any legacy Ethernet NICs also means that Open-MX suffers from limited hardware features. For instance, it has to work around the inability to perform zero-copy receive by offloading memory copy on Intel I/O Acceleration Technology (I/OAT) [11] . We propose to improve the cache-efficiency of the receive side of Ethernet-based message passing by extending the hardware IP multiqueue support to filter Open-MX packets as well. Such a stateless feature requires very little computing power and software support compared to the existing complex and stateful features such as zero-copy or TOE (TCP Offload Engine). Parallelizing the stack is known to be important on modern machines [24] . We are looking at it in the context of binding the whole packet processing to the same core, from the bottom interrupt handler up to the application. This paper is an extended revision of [12] organized as follows. We present Open-MX, its possible cache-inefficiency problems, related works and our motivations in Section 2. Section 3 describes our proposal to combine the multiqueue extension in the Myri-10G firmware and its corresponding support in Open-MX so as to we build an automatic binding facility for both the receive handler in the driver and the target application. Section 4 presents a performance evaluation which shows that our model achieves satisfying performance for microbenchmarks, reduces the overall cache miss rate, and improves performance in the case of large communication patterns. BACKGROUND AND MOTIVATIONS In this section, we briefly describe the Open-MX stack and how the cache is involved on the receive side. We then present previous works on the cache efficiency of high-performance networking stacks. We finally detail our motivation to add some Open-MX specific support in the NIC and our objectives with this implementation. Cache Efficiency Issues in the Open-MX Stack The Open-MX stack * aims at providing high-performance message passing over any generic Ethernet hardware. First, it bypasses the usual TCP/IP stack so as to exploit the networking * Available for download at :1-15 Prepared using cpeauth.cls ‡ Open-MX mimics MX behavior near 4 kB. This message size is a good compromise between smaller sizes (where the big Ethernet latency matters) and larger messages (where intensive memory copies may limit performance).
doi:10.1007/978-3-642-03869-3_98 fatcat:3mnxv6hn65ht7coqw3ioqtlq3i