A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2014; you can also visit <a rel="external noopener" href="http://www.cs.unc.edu/~prins/RecentPubs/ross11.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<i title="ACM Press">
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '11
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1988796.1988804">doi:10.1145/1988796.1988804</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/r7fcxjxulbe7pm66zacsdn2gam">fatcat:r7fcxjxulbe7pm66zacsdn2gam</a> </span>
more »... y complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well between a parent task and its newly created child tasks. We extended the open-source Qthreads threading library to implement our scheduler, accepting OpenMP programs through the ROSE compiler. We also present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven different task parallel run time scheduler implementations on current generation multi-socket multicore systems: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the original Qthreads fullydistributed scheduler. In addition, we compare our results against OpenMP implementations from Intel and GCC. Hierarchical scheduling in Qthreads is competitive on all benchmarks. On several benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GCC OpenMP run time systems.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20140730033324/http://www.cs.unc.edu/~prins/RecentPubs/ross11.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/4a/10/4a10db4559a490bced4555ae52a67fcc34ac39f8.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/1988796.1988804"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>