Performance Evaluation of OpenMP Applications with Nested Parallelism [chapter]

Yoshizumi Tanaka, Kenjiro Taura, Mitsuhisa Sato, Akinori Yonezawa
2000 Lecture Notes in Computer Science  
Many existing OpenMP systems do not suciently implement nested parallelism. This is supposedly because nested parallelism is believed to require a signicant implementation eort, incur a large overhead, or lack applications. This paper demonstrates Omni/ST, a simple and ecient implementation of OpenMP nested parallelism using StackThreads/MP, which i s a ne-grain thread library. Thanks to StackThreads/MP, OpenMP parallel constructs are simply mapped onto thread creation primitives of
more » ... /MP, y et they are eciently managed with a xed number of threads in the underlying thread package (e.g., Pthreads). Experimental results on Sun Ultra Enterprise 10000 with up to 60 processors show that overhead imposed by nested parallelism is very small (1-3% in ve out of six applications, and 8% for the other), and there is a signicant scalability benet for applications with nested parallelism. Introduction OpenMP is increasingly becoming popular for high performance computing on shared memory machines. Its current specication, however, is restrictive in many ways [16] . Specically, the current OpenMP species that nested parallelism is optional, which means that implementation can ignore parallel directives encountered during the execution of another parallel directive. Many existing OpenMP systems in fact support no nested parallelism or a very limited form of it based on load-based inlining [1, 11, 4] . The basic justication will presumably be as follows: Benet: Assuming that sucient parallelism is obtained at the outermost loop, extracting nested parallelism does no good for overall performance. Cost: Ecient implementation of nested parallelism is dicult or complex, because it needs a thread management that can comfortably handle a very large numb e r o f t h r e a d s . Standard thread libraries such as Pthreads [9] or Win32 threads [12] do not meet this criteria; they incur a large overhead for thread creation and do not tolerate a large number of threads.
doi:10.1007/3-540-40889-4_8 fatcat:zjiqxa365beqle3iyidvclopki