Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, Wu-Chun Feng
<span title="">2016</span> <i title="IEEE"> <a target="_blank" rel="noopener" href="https://fatcat.wiki/container/z2ot7tezyrgpxlxgssko2bzyw4" style="color: black;">2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)</a> </i> &nbsp;
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous research studies, there has been neither a solid quantification of such overhead, nor
more &raquo; ... on when to use each of the different approaches. Therefore, we quantify the synchronization overhead relative to the number of kernel launches and the input data sizes. The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application. Our results show that implicit CPU synchronization has a significant overhead that hurts the application performance when using medium to large data sizes with relatively large number of kernel launches (i.e. ∼ 1100-5000). Hence, it is recommended to use explicit CPU synchronization with these configurations. In addition, among the three different approaches, we conclude that dynamic parallelism (DP) is the most efficient with small data sizes (i.e., ≤128k bytes), regardless of the number of kernel launches. Also, Dynamic Parallelism (DP), implicitly, performs inter-block (i.e. global) synchronization with no CPU intervention. Therefore, DP significantly reduces the power consumed by the CPU and PCIe for global synchronization. Our findings show that DP reduces the power consumption by ∼ 8-10%. However, DP-based synchronization is a trade-off, in which it is accompanied by ∼ 2-5% performance loss.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/mascots.2016.58">doi:10.1109/mascots.2016.58</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/mascots/HarbF16.html">dblp:conf/mascots/HarbF16</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/s5q4kd2arvcpziwovbr67k4rvi">fatcat:s5q4kd2arvcpziwovbr67k4rvi</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170102212458/http://synergy.cs.vt.edu/pubs/papers/harb-sync-kernel-mascots16.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/02/71/02713ae2e25319e210b324aa1af60fa3ec6154d2.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1109/mascots.2016.58"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> ieee.com </button> </a>