A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit <a rel="external noopener" href="https://dl.acm.org/doi/pdf/10.1145/3534969">the original URL</a>. The file type is <code>application/pdf</code>.
Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks
<span title="2022-05-17">2022</span>
<i title="Association for Computing Machinery (ACM)">
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/qfvmhupy5fb6tcrtb2orjpq5e4" style="color: black;">ACM Transactions on Reconfigurable Technology and Systems</a>
</i>
This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded (CGMT) RNN/LSTM hardware architecture, switching tasks among threads when
<span class="external-identifiers">
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/3534969">doi:10.1145/3534969</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/g2lj3qnv3bdsxb5mkmwlkoezpu">fatcat:g2lj3qnv3bdsxb5mkmwlkoezpu</a>
</span>
more »
... RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase run-time hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220518132420/https://dl.acm.org/doi/pdf/10.1145/3534969" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/bc/7e/bc7e3fb53c7c36467ecfd09559b5ff32446818ae.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/3534969">
<button class="ui left aligned compact blue labeled icon button serp-button">
<i class="external alternate icon"></i>
acm.org
</button>
</a>