Filters








51 Hits in 3.1 sec

Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA

Ruizhe Zhao, Ho-Cheung Ng, Wayne Luk, Xinyu Niu
2018 2018 28th International Conference on Field Programmable Logic and Applications (FPL)  
FPGA becomes a popular technology for implementing Convolutional Neural Network (CNN) in recent years.  ...  standard convolution layers with efficient convolution blocks, and applying layer fusion to enhance hardware design performance.  ...  Winograd transformation is also used to accelerate spatial convolution. A.  ... 
doi:10.1109/fpl.2018.00033 dblp:conf/fpl/ZhaoNLN18 fatcat:juidwpy2jrgfldzkn3j4pc4mpi

Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA [article]

Ruizhe Zhao, Ho-Cheung Ng, Wayne Luk, Xinyu Niu
2018 arXiv   pre-print
FPGA becomes a popular technology for implementing Convolutional Neural Network (CNN) in recent years.  ...  standard convolution layers with efficient convolution blocks, and applying layer fusion to enhance hardware design performance.  ...  Winograd transformation is also used to accelerate spatial convolution. A.  ... 
arXiv:1809.03318v1 fatcat:2i2mtinbizhvnakoeiecy6gely

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Liancheng Jia, Yun Liang, Xiuhong Li, Liqiang Lu, Shengen Yan
2020 IEEE transactions on computers  
In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel.  ...  Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations.  ...  In this paper, we propose a new kernel fusion technique for Winograd convolution algorithms on GPUs.  ... 
doi:10.1109/tc.2020.2973144 fatcat:quv5yqwzxrcnjorf6alqxx73eu

A Real Time Super Resolution Accelerator with Tilted Layer Fusion [article]

An-Jung Huang, Kai-Chieh Hsu, Tian-Sheuan Chang
2022 arXiv   pre-print
To solve the above issues, this paper proposes a real-time hardware accelerator with the tilted layer fusion method that reduces the external DRAM bandwidth by 92\% and just needs 102KB on-chip memory.  ...  The design implemented with a 40nm CMOS process achieves 1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it has higher throughput and lower area cost than previous designs.  ...  Reference [12] adopts the constant kernel size Winograd convolution for regular hardware design.  ... 
arXiv:2205.03997v1 fatcat:2zkp5q44mrewhjpecej7vvjrhu

DWM: A Decomposable Winograd Method for Convolution Acceleration [article]

Di Huang, Xishan Zhang, Rui Zhang, Tian Zhi, Deyuan He, Jiaming Guo, Chang Liu, Qi Guo, Zidong Du, Shaoli Liu, Tianshi Chen, Yunji Chen
2020 arXiv   pre-print
In this paper, we propose a novel Decomposable Winograd Method (DWM), which breaks through the limitation of original Winograd's minimal filtering algorithm to a wide and general convolutions.  ...  Comparing against the original Winograd, the proposed DWM is able to support all kinds of convolutions with a speedup of ~2, without affecting the numerical accuracy.  ...  By applying Winograd algorithm to them, we have reduced the multiplications of 1-D convolutions with stride > 1 successfully. For 2-D convolutions, we nest the 1-D convolution methods.  ... 
arXiv:2002.00552v1 fatcat:yyfnolqjr5glnfbae6sfiiavsy

DWM: A Decomposable Winograd Method for Convolution Acceleration

Di Huang, Xishan Zhang, Rui Zhang, Tian Zhi, Deyuan He, Jiaming Guo, Chang Liu, Qi Guo, Zidong Du, Shaoli Liu, Tianshi Chen, Yunji Chen
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we propose a novel Decomposable Winograd Method (DWM), which breaks through the limitation of original Winograd's minimal filtering algorithm to a wide and general convolutions.  ...  Comparing against the original Winograd, the proposed DWM is able to support all kinds of convolutions with a speedup of ∼2, without affecting the numerical accuracy.  ...  By applying Winograd algorithm to them, we have reduced the multiplications of 1-D convolutions with stride > 1 successfully. For 2-D convolutions, we nest the 1-D convolution methods.  ... 
doi:10.1609/aaai.v34i04.5838 fatcat:hcdbctgfxbgzjalee24y6ilhhq

Hardware Compilation of Deep Neural Networks: An Overview

Ruizhe Zhao, Shuanglong Liu, Ho-Cheung Ng, Erwei Wang, James J. Davis, Xinyu Niu, Xiwei Wang, Huifeng Shi, George A. Constantinides, Peter Y. K. Cheung, Wayne Luk
2018 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP)  
Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies.  ...  Finally, we propose some future directions for related research.  ...  Lu et al. took input tile size for Winograd as a configurable parameter [71] , while Aydonat et al. used a fixed Winograd configuration and explored parallelism in other dimensions [70] .  ... 
doi:10.1109/asap.2018.8445088 dblp:conf/asap/ZhaoLNWDNWSCCL18 fatcat:v5txrrsfifa6bah2oksjdlrsgi

Accelerating Deep Neural Networks implementation: A survey

Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud
2021 IET Computers & Digital Techniques  
Finally, a survey of research works aiming to accelerate the implementation of DNN models on FPGAs is provided.  ...  Field Programmable Gate Arrays (FPGAs) are promising platforms for the deployment of large-scale DNN which seek to reach a balance between the above objectives.  ...  Based on unrolling and tiling loops, Rahman et al. [96] presented ICAN, a 3D compute tile for convolutional layers.  ... 
doi:10.1049/cdt2.12016 fatcat:3kl4j5ztl5eahmgv7vetu2egay

Winograd Convolution for Deep Neural Networks: Efficient Point Selection [article]

Syed Asad Alam, Andrew Anderson, Barbara Barabasz, David Gregg
2022 arXiv   pre-print
A defining feature of each Winograd convolution algorithm is a set of real-value points where polynomials are sampled.  ...  We study a range of sizes for small convolutions and achieve reduction in error ranging from 2 around 59 cases when we select a subset of our proposed points which will always lead to a lower error.  ...  Israr Ali Khan of Namal Institute Mianwali, Pakistan for his support.  ... 
arXiv:2201.10369v1 fatcat:gpwr6gchdfg55hta5ejmmem33a

MNN: A Universal and Efficient Inference Engine [article]

Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, Chengfei Lv, Zhihua Wu
2020 arXiv   pre-print
To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications.  ...  In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to  ...  ACKNOWLEDGEMENTS We thank Chaoyue Niu for helpful discussions and the anonymous reviewers for their valuable comments to improve our work.  ... 
arXiv:2002.12418v1 fatcat:ppeykiv57nc6bfqa74lyzse3by

A Survey on System-Level Design of Neural Network Accelerators

Kenshu Seto
2021 Journal of Integrated Circuits and Systems  
For the nested loop of convolutional (CONV) layers, we discuss the effects of loop optimizations such as loop interchange, tiling, unrolling and fusion on CNN accelerators.  ...  In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators.  ...  Loop fusion Loop fusion [29] fuses a set of loops into a single fully nested loop as shown in Fig. 16 .  ... 
doi:10.29292/jics.v16i2.505 fatcat:ibbkeob42jepbguezlptws2qha

AMAIX In-Depth: A Generic Analytical Model for Deep Learning Accelerators

Niko Zurstraßen, Lukas Jünger, Tim Kogel, Holger Keding, Rainer Leupers
2022 20th IEEE International Conference on Embedded Computer Systems: Architectures  
A commonly used method for finding these solutions as early as possible in the design cycle, is the employment of analytical models which try to describe a design by simple yet insightful and sufficiently  ...  In recent years the growing popularity of Convolutional Neural Network(CNNs) has driven the development of specialized hardware, so called Deep Learning Accelerator (DLAs).  ...  For this reason the first convolution must be split into 5 tiles (see Table 2 ), as the ifmap does not fit into the 512 KiB convolution buffer as a whole.  ... 
doi:10.18154/rwth-2022-02911 fatcat:vpwnyymaxrfwvirzfs2r3drtgm

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Xiaoyu Yu, Jianlin Gao, Yuwei Wang, Jie Miao, Ephrem Wu, Heng Zhang, Yu Meng, Bo Zhang, Biao Min, Dewei Chen
2019 2019 29th International Conference on Field Programmable Logic and Applications (FPL)  
In this paper, we develop an FPGA acceleration platform that leverages a unified framework architecture for general-purpose convolutional neural network (CNN) inference acceleration at a data center.  ...  For various non-convolution operators, a filter processing unit is designed for general-purpose filter-like/pointwise operators.  ...  On the basis of the tile partition method in Section III C, the width of the tile of the input feature map can be flexibly narrowed to buffer more rows when a larger kernel size is used. VI.  ... 
doi:10.1109/fpl.2019.00032 dblp:conf/fpl/YuGWMWZMZMC19 fatcat:l6jrzquumjfwnj7r46bjbymwk4

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference [article]

Ziheng Wang
2020 arXiv   pre-print
For sparse 3x3 convolutions, we show speedups of over 5x on use cases in ResNet-50.  ...  In this paper, we present SparseRT, a code generator that leverage unstructured sparsity to accelerate sparse linear algebra operations in deep learning inference on GPUs.  ...  For dense convolutions, it is typically materialized on the fly tile-by-tile as the computation proceeds [7] .  ... 
arXiv:2008.11849v1 fatcat:x4usrp5ocrhifkuicim3nujtlm

FPGA Implementation for Odor Identification with Depthwise Separable Convolutional Neural Network

Zhuofeng Mo, Dehan Luo, Tengteng Wen, Yu Cheng, Xin Li
2021 Sensors  
In this article, we propose a method for implementing a deep neural network for odor identification in a small-scale Field-Programmable Gate Array (FPGA).  ...  First, a lightweight odor identification with depthwise separable convolutional neural network (OI-DSCNN) is proposed to reduce parameters and accelerate hardware implementation performance.  ...  The implementation of separable depthwise convolution and the Winograd algorithm could reduce the number of convolution parameters and accelerate the odor identifying rate.  ... 
doi:10.3390/s21030832 pmid:33513692 fatcat:73i3v2fgabhovf5vn2pkhulkoe
« Previous Showing results 1 — 15 out of 51 results