A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA
2018
2018 28th International Conference on Field Programmable Logic and Applications (FPL)
FPGA becomes a popular technology for implementing Convolutional Neural Network (CNN) in recent years. ...
standard convolution layers with efficient convolution blocks, and applying layer fusion to enhance hardware design performance. ...
Winograd transformation is also used to accelerate spatial convolution.
A. ...
doi:10.1109/fpl.2018.00033
dblp:conf/fpl/ZhaoNLN18
fatcat:juidwpy2jrgfldzkn3j4pc4mpi
Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA
[article]
2018
arXiv
pre-print
FPGA becomes a popular technology for implementing Convolutional Neural Network (CNN) in recent years. ...
standard convolution layers with efficient convolution blocks, and applying layer fusion to enhance hardware design performance. ...
Winograd transformation is also used to accelerate spatial convolution.
A. ...
arXiv:1809.03318v1
fatcat:2i2mtinbizhvnakoeiecy6gely
Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels
2020
IEEE transactions on computers
In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. ...
Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. ...
In this paper, we propose a new kernel fusion technique for Winograd convolution algorithms on GPUs. ...
doi:10.1109/tc.2020.2973144
fatcat:quv5yqwzxrcnjorf6alqxx73eu
A Real Time Super Resolution Accelerator with Tilted Layer Fusion
[article]
2022
arXiv
pre-print
To solve the above issues, this paper proposes a real-time hardware accelerator with the tilted layer fusion method that reduces the external DRAM bandwidth by 92\% and just needs 102KB on-chip memory. ...
The design implemented with a 40nm CMOS process achieves 1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it has higher throughput and lower area cost than previous designs. ...
Reference [12] adopts the constant kernel size Winograd convolution for regular hardware design. ...
arXiv:2205.03997v1
fatcat:2zkp5q44mrewhjpecej7vvjrhu
DWM: A Decomposable Winograd Method for Convolution Acceleration
[article]
2020
arXiv
pre-print
In this paper, we propose a novel Decomposable Winograd Method (DWM), which breaks through the limitation of original Winograd's minimal filtering algorithm to a wide and general convolutions. ...
Comparing against the original Winograd, the proposed DWM is able to support all kinds of convolutions with a speedup of ~2, without affecting the numerical accuracy. ...
By applying Winograd algorithm to them, we have reduced the multiplications of 1-D convolutions with stride > 1 successfully. For 2-D convolutions, we nest the 1-D convolution methods. ...
arXiv:2002.00552v1
fatcat:yyfnolqjr5glnfbae6sfiiavsy
DWM: A Decomposable Winograd Method for Convolution Acceleration
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
In this paper, we propose a novel Decomposable Winograd Method (DWM), which breaks through the limitation of original Winograd's minimal filtering algorithm to a wide and general convolutions. ...
Comparing against the original Winograd, the proposed DWM is able to support all kinds of convolutions with a speedup of ∼2, without affecting the numerical accuracy. ...
By applying Winograd algorithm to them, we have reduced the multiplications of 1-D convolutions with stride > 1 successfully. For 2-D convolutions, we nest the 1-D convolution methods. ...
doi:10.1609/aaai.v34i04.5838
fatcat:hcdbctgfxbgzjalee24y6ilhhq
Hardware Compilation of Deep Neural Networks: An Overview
2018
2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. ...
Finally, we propose some future directions for related research. ...
Lu et al. took input tile size for Winograd as a configurable parameter [71] , while Aydonat et al. used a fixed Winograd configuration and explored parallelism in other dimensions [70] . ...
doi:10.1109/asap.2018.8445088
dblp:conf/asap/ZhaoLNWDNWSCCL18
fatcat:v5txrrsfifa6bah2oksjdlrsgi
Accelerating Deep Neural Networks implementation: A survey
2021
IET Computers & Digital Techniques
Finally, a survey of research works aiming to accelerate the implementation of DNN models on FPGAs is provided. ...
Field Programmable Gate Arrays (FPGAs) are promising platforms for the deployment of large-scale DNN which seek to reach a balance between the above objectives. ...
Based on unrolling and tiling loops, Rahman et al. [96] presented ICAN, a 3D compute tile for convolutional layers. ...
doi:10.1049/cdt2.12016
fatcat:3kl4j5ztl5eahmgv7vetu2egay
Winograd Convolution for Deep Neural Networks: Efficient Point Selection
[article]
2022
arXiv
pre-print
A defining feature of each Winograd convolution algorithm is a set of real-value points where polynomials are sampled. ...
We study a range of sizes for small convolutions and achieve reduction in error ranging from 2 around 59 cases when we select a subset of our proposed points which will always lead to a lower error. ...
Israr Ali Khan of Namal Institute Mianwali, Pakistan for his support. ...
arXiv:2201.10369v1
fatcat:gpwr6gchdfg55hta5ejmmem33a
MNN: A Universal and Efficient Inference Engine
[article]
2020
arXiv
pre-print
To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. ...
In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to ...
ACKNOWLEDGEMENTS We thank Chaoyue Niu for helpful discussions and the anonymous reviewers for their valuable comments to improve our work. ...
arXiv:2002.12418v1
fatcat:ppeykiv57nc6bfqa74lyzse3by
A Survey on System-Level Design of Neural Network Accelerators
2021
Journal of Integrated Circuits and Systems
For the nested loop of convolutional (CONV) layers, we discuss the effects of loop optimizations such as loop interchange, tiling, unrolling and fusion on CNN accelerators. ...
In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators. ...
Loop fusion Loop fusion [29] fuses a set of loops into a single fully nested loop as shown in Fig. 16 . ...
doi:10.29292/jics.v16i2.505
fatcat:ibbkeob42jepbguezlptws2qha
AMAIX In-Depth: A Generic Analytical Model for Deep Learning Accelerators
2022
20th IEEE International Conference on Embedded Computer Systems: Architectures
A commonly used method for finding these solutions as early as possible in the design cycle, is the employment of analytical models which try to describe a design by simple yet insightful and sufficiently ...
In recent years the growing popularity of Convolutional Neural Network(CNNs) has driven the development of specialized hardware, so called Deep Learning Accelerator (DLAs). ...
For this reason the first convolution must be split into 5 tiles (see Table 2 ), as the ifmap does not fit into the 512 KiB convolution buffer as a whole. ...
doi:10.18154/rwth-2022-02911
fatcat:vpwnyymaxrfwvirzfs2r3drtgm
A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks
2019
2019 29th International Conference on Field Programmable Logic and Applications (FPL)
In this paper, we develop an FPGA acceleration platform that leverages a unified framework architecture for general-purpose convolutional neural network (CNN) inference acceleration at a data center. ...
For various non-convolution operators, a filter processing unit is designed for general-purpose filter-like/pointwise operators. ...
On the basis of the tile partition method in Section III C, the width of the tile of the input feature map can be flexibly narrowed to buffer more rows when a larger kernel size is used.
VI. ...
doi:10.1109/fpl.2019.00032
dblp:conf/fpl/YuGWMWZMZMC19
fatcat:l6jrzquumjfwnj7r46bjbymwk4
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference
[article]
2020
arXiv
pre-print
For sparse 3x3 convolutions, we show speedups of over 5x on use cases in ResNet-50. ...
In this paper, we present SparseRT, a code generator that leverage unstructured sparsity to accelerate sparse linear algebra operations in deep learning inference on GPUs. ...
For dense convolutions, it is typically materialized on the fly tile-by-tile as the computation proceeds [7] . ...
arXiv:2008.11849v1
fatcat:x4usrp5ocrhifkuicim3nujtlm
FPGA Implementation for Odor Identification with Depthwise Separable Convolutional Neural Network
2021
Sensors
In this article, we propose a method for implementing a deep neural network for odor identification in a small-scale Field-Programmable Gate Array (FPGA). ...
First, a lightweight odor identification with depthwise separable convolutional neural network (OI-DSCNN) is proposed to reduce parameters and accelerate hardware implementation performance. ...
The implementation of separable depthwise convolution and the Winograd algorithm could reduce the number of convolution parameters and accelerate the odor identifying rate. ...
doi:10.3390/s21030832
pmid:33513692
fatcat:73i3v2fgabhovf5vn2pkhulkoe
« Previous
Showing results 1 — 15 out of 51 results