Adaptive two-level thread management for fast MPI execution on shared memory machines

Kai Shen, Hong Tang, Tao Yang
1999 Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99  
This paper addresses performance portability of MPI code on multiprogrammed shared memory machines. Conventional MPI implementations map each MPI node to an OS process, which suffers severe performance degradation in multiprogrammed environments. Our previous work (TMPI) has developed compile/run-time techniques to support threaded MPI execution by mapping each MPI node to a kernel thread. However, kernel threads have context switch cost higher than user-level threads and this leads to longer
more » ... inning time requirement during MPI synchronization. This paper presents an adaptive two-level thread scheme for MPI to reduce context switch and synchronization cost. This scheme also exposes thread scheduling information at user-level, which allows us to design an adaptive event waiting strategy to minimize CPU spinning and exploit cache affinity. Our experiments show that the MPI system based on the proposed techniques has great performance advantages over the previous version of TMPI and the SGI MPI implementation in multiprogrammed environments. The improvement ratio can reach as much as 161% or even more depending on the degree of multiprogramming. And the eager probe brings the cost further down to ¥ § © . 6 Kernel-level context switch ¤ ¥ § © User-level context switch with two stack switches § © User-level context switch with eager probe ¥ § © cost of kernel threads is not so expensive in SGI machines, thread yielding and resumption is cumbersome and fairly slow (e.g. the thread yield function resumes a kernel thread in a non-deterministic manner and the shortest sleep interval for a nanosleep call on Power Challenge is ¤ © ). As a result, the spin time in the TMPI event waiting function is fairly large. Using the user-level threads instead of pure kernel threads significantly reduces the blocking cost; hence the spinning period can be shortened proportionally and CPU waste can be minimized in TMPI-2 compared to the case in TMPI. The above spin-block approach only considers the penalty of blocking in terms of context switch and thread resumption. There is actually more overhead in spinning and blocking: 1) spinning may be useless if the caller thread that executes the wakeup event is not currently being scheduled; 2) blocking may suffer cache refresh penalty due to context switch. The previous work on scheduler-conscious synchronization [4, 17] has considered using OS scheduling information to guide lock and barrier implementations. There is also work on OS scheduling to exploit cache affinity [30] . We combine these two ideas together and extend them for the MPI runtime system. The unique aspect of our situation is that scheduling information is exposed at user-level because of our two-level thread management, and the context switch cost in our system is relatively small (only
doi:10.1145/331532.331581 dblp:conf/sc/ShenTY99 fatcat:x3sbdt6cmnclzgxftkoavp4v3u