Scheduling task parallelism on multi-socket multicore systems

Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, Jan F. Prins
<span title="">2011</span> <i title="ACM Press"> Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers - ROSS &#39;11 </i> &nbsp;
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an
more &raquo; ... y complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well between a parent task and its newly created child tasks. We extended the open-source Qthreads threading library to implement our scheduler, accepting OpenMP programs through the ROSE compiler. We also present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven different task parallel run time scheduler implementations on current generation multi-socket multicore systems: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the original Qthreads fullydistributed scheduler. In addition, we compare our results against OpenMP implementations from Intel and GCC. Hierarchical scheduling in Qthreads is competitive on all benchmarks. On several benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GCC OpenMP run time systems.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1145/1988796.1988804</a> <a target="_blank" rel="external noopener" href="">fatcat:r7fcxjxulbe7pm66zacsdn2gam</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>