A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit <a rel="external noopener" href="http://synergy.cs.vt.edu/pubs/papers/harb-sync-kernel-mascots16.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels
<span title="">2016</span>
<i title="IEEE">
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/z2ot7tezyrgpxlxgssko2bzyw4" style="color: black;">2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)</a>
</i>
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor
<span class="external-identifiers">
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/mascots.2016.58">doi:10.1109/mascots.2016.58</a>
<a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/mascots/HarbF16.html">dblp:conf/mascots/HarbF16</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/s5q4kd2arvcpziwovbr67k4rvi">fatcat:s5q4kd2arvcpziwovbr67k4rvi</a>
</span>
more »
... on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ∼ 1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ≤128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ∼ 8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ∼ 2-5% performance loss.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170102212458/http://synergy.cs.vt.edu/pubs/papers/harb-sync-kernel-mascots16.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/02/71/02713ae2e25319e210b324aa1af60fa3ec6154d2.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/mascots.2016.58">
<button class="ui left aligned compact blue labeled icon button serp-button">
<i class="external alternate icon"></i>
ieee.com
</button>
</a>