Accelerating shared virtual memory via general-purpose network interface support

Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh
2001 ACM Transactions on Computer Systems  
Clusters of symmetric multiprocessors (SMPs) are important platforms for high performance computing. With the success of hardware cache-coherent distributed shared memory (DSM), a lot of effort has also been made to support the coherent shared address space programming model in software on clusters. Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the performance of software virtual memory (SVM) is still
more » ... r from that achieved on hardware DSM systems. The goal of this paper is to improve the performance of SVM on system area network clusters by considering communication and protocol layer interactions. We first examine what are the important communication system bottlenecks that stand in the way of improving parallel performance of SVM clusters; in particular, which parameters of the communication architecture are most important to improve further relative to processor speed, which ones are already adequate on modern systems for most applications, and how will this change with technology in the future. We find that the most important communication subsystem cost to improve is the overhead of generating and delivering interrupts for asynchronous protocol processing. Then we proceed to show that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVMdependent. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support for shared Memory Abstractions), on a cluster of SMPs with a programmable NI. We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardwarecoherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Parts of this work have appeared as conference publication in SC97 [9] and ISCA99 [8] .
doi:10.1145/367742.367747 fatcat:fxb2cwkep5h3hh6nhjqfbsr5da