190 Hits in 3.8 sec

Efficient Residue Number System Based Winograd Convolution [article]

Zhi-Gang Liu, Matthew Mattina
2020 arXiv   pre-print
Our work extends the Winograd algorithm to Residue Number System (RNS).  ...  Prior research has shown that Winograd algorithm can reduce the computational complexity of convolutional neural networks (CNN) with weights and activations represented in floating point.  ...  Winograd Convolution over Residue Number System We extend the Winograd convolution algorithm described in section 4 to Residue Number System (RNS) in section 3 to formulate a new implementation.  ... 
arXiv:2007.12216v1 fatcat:64sbsyjr3rbhlbbx5xf7yxq2rq

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs [article]

Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins
2019 arXiv   pre-print
This paper aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today.  ...  The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs).  ...  We introduce a novel region-wise multi-channel scheme using GEMM (General Matrix Multiplication) for energy efficient implementation of Winograd or Cook-Toom based convolution on resource-constrained mobile  ... 
arXiv:1903.01521v1 fatcat:5vdolq5d2retdasckwocglo25i

A Real-Time FPGA Accelerator Based on Winograd Algorithm for Underwater Object Detection

Liangwei Cai, Ceng Wang, Yuan Xu
2021 Electronics  
Furthermore, the FPGA implementation of various convolution in the proposed network is optimized based on the Winograd algorithm.  ...  Compared to CPU, our accelerator achieves 7.5×–8.7× speedup and 52×–60× energy efficiency.  ...  In the case of 2-D convolution with the stride of 1, the Winograd algorithm can reduce the number of convolutional multiplications from m 2 r 2 to n 2 .  ... 
doi:10.3390/electronics10232889 fatcat:7p6woc4g6jc6tozls7ivi5h2py

A generalized method for constructing subquadratic complexity GF(2/sup k/) multipliers

B. Sunar
2004 IEEE transactions on computers  
To obtain the short convolution algorithms, the Winograd short convolution algorithm is reintroduced and analyzed in the context of polynomial multiplication.  ...  The construction is obtained by recursively extending short convolution algorithms and nesting them.  ...  ACKNOWLEDGMENTS This material is based upon work supported by the US National Science Foundation under Grant No. ANI-0112889.  ... 
doi:10.1109/tc.2004.52 fatcat:3xzdahqz35bb7pdccamazbgpni

General Method for Prime-point Cyclic Convolution over the Real Field [article]

Qi Cai, Tsung-Ching Lin, Yuanxin Wu, Wenxian Yu, Trieu-Kien Truong
2019 arXiv   pre-print
A general and fast method is conceived for computing the cyclic convolution of n points, where n is a prime number.  ...  Clearly, it is well-known that the discrete Fourier transform (DFT) can be expressed in terms of cyclic convolution, so it can be utilized to compute the DFT when the block length is a prime.  ...  systems.  ... 
arXiv:1905.03398v1 fatcat:aqcjnldkkrgyhit3ygbtarrsdi

On the computation of discrete fourier transform using fermat number transform

Wan-Chi Siu, A.G. Constantinides
1984 IEE Proceedings F Communications Radar and Signal Processing  
The number of multiplications per point is for most cases not more than one, whereas the number of shift-adds is approximately equal to the number of additions in the Winograd-Fourier-transform algorithm  ...  In the paper the results of a study using Fermat number transforms (FNTs) to compute discrete Fourier transforms (DFTs) are presented.  ...  Let us consider the cyclic convolution of the two sequences, {x n : n = 0, 1, ..., N -1} and {h n : n = 0, 1, ..., N -1}. where /c = 0, I,..., N -1 and Let us denote (g") P as the residue of the number  ... 
doi:10.1049/ip-f-1.1984.0003 fatcat:itfatpy34fchxocw5gbywh5mku

Very fast discrete Fourier transform using number theoretic transform

Wan-Chi Siu, A.G. Constantinides
1983 IEE Proceedings G (Electronic Circuits and Systems)  
It is shown that number theoretic transforms (NTT) can be used to compute discrete Fourier transform (DFT) very efficiently.  ...  By noting some simple properties of number theory and the DFT, the total number of real multiplications for a length-P DFT is reduced to (P -1).  ...  Winograd [3] showed that the minimum number of multiplications required to compute the circular convolution of two length-TV sequences is 2N -K, where K is the number of divisors of TV including 1 and  ... 
doi:10.1049/ip-g-1.1983.0036 fatcat:3cfetpciknamrg3xn2xwa7unqy

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator [article]

Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor Aamodt
2019 arXiv   pre-print
Comparing DRAM Efficiency to IPC Fig. 21 . 21 Backwards Filter Convolution (Winograd Nonfused) Shader IPC Plot Fig. 22. Forward Convolution (Winograd Nonfused) Warp Divergence Plot Fig. 23.  ...  DRAM Efficiency and Utilization C. Global and Shader IPC The Winograd Nonfused algorithm has the highest IPCs for all three types of convolution.  ... 
arXiv:1811.08933v2 fatcat:25pubfhlqndxzkwxniylse45wm

Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors [article]

Lukas Baischer, Matthias Wess, Nima TaheriNejad
2021 arXiv   pre-print
FPGA-based implementations are well-suited to show the effect of DNN optimization methods on accuracy and throughput. For this reason, the focus of this work is more on FPGA-based implementations.  ...  Deep neural networks (DNNs) have the advantage that they can take into account a large number of parameters, which enables them to solve complex tasks.  ...  In a digital system, base 2 is optimal, since it allows to transform multiplications into shifts [28, 45] .  ... 
arXiv:2104.09252v1 fatcat:625wtuskhff3lbswhwmj7decni

Deep Tensor Convolution on Multicores [article]

David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit
2017 arXiv   pre-print
Here we extend and optimize the faster Winograd-class of convolutional algorithms to the N-dimensional case and specifically for CPU hardware.  ...  Deep convolutional neural networks (ConvNets) of 3-dimensional kernels allow joint modeling of spatiotemporal features.  ...  Although recent studies have begun to explore extensions of FFT-based convolution to 3-dimensions (Zlateski et al., 2016) , to our knowledge there have been no attempts to extend Lavin and Gray's Winograd-style  ... 
arXiv:1611.06565v3 fatcat:ouzr3bssdnftxe6zz5nmxdow7e

Effective and High Computing Algorithms for Convolution Neural Networks

P Syamala Rao, Dr G.P.SaradhiVarma, Rajasekhar Mutukuri
2018 International Journal of Engineering & Technology  
For large filters, Conventional Faster Fourier Transform based convolution is preferably fast, yet in case of small, 3 × 3 filters state of the art convolutional neural networks is used.  ...  The computation speed of the training set determines that in these situations convolution neural networks was a success.  ...  algorithm Fast Fourier Transform based convolution network of .  ... 
doi:10.14419/ijet.v7i3.31.18203 fatcat:24m3ebi5dbccjg5l7j6pdb6e7q

Applying the Residue Number System to Network Inference [article]

Mohamed Abdelhamid, Skanda Koppula
2017 arXiv   pre-print
In particular, we propose using the Residue Number System (RNS) as the internal number representation across all layer evaluations, allowing us to explore usage of the more power-efficient RNS multipliers  ...  Preliminaries Residue Number System At its core, the Residue Number System relies on the Chinese Remainder Theorem (CRT) to represent a large integer as a tuple of smaller integers.  ...  In particular, we propose using the Residue Number System (RNS) as the internal number representation across all layer evaluations, allowing us to explore usage of the more power-efficient RNS multipliers  ... 
arXiv:1712.04614v1 fatcat:2sf7giatqncappuodkg27k6kfu

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing [article]

Akshay Dua, Yixing Li, Fengbo Ren
2020 arXiv   pre-print
This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in  ...  Systolic-CNN adopts a highly pipelined and paralleled 1-D systolic array architecture, which efficiently explores both spatial and temporal parallelism for accelerating CNN inference on FPGAs.  ...  Since the current Systolic-CNN architecture is compatible for Winograd-based convolutions, we would like to explore adding support for Winograd-based CNN models to further improve its inference latency  ... 
arXiv:2012.03177v1 fatcat:h5alzshjybhv7kmpmeb46an3qm

Real-Time Super-Resolution System of 4K-Video Based on Deep Learning [article]

Yanpeng Cao, Chengcheng Wang, Changjun Song, Yongming Tang, He Li
2021 arXiv   pre-print
This paper explores the possibility of real-time VSR system and designs an efficient and generic VSR network, termed EGVSR.  ...  The proposed EGVSR is based on spatio-temporal adversarial learning for temporal coherence.  ...  SYNTHESIS RESULTS ON FPGA Input Size Method 2019 [26] Method 2017 [27] Ours: WinoConv LUT-based Direct Convolution DSP-based Direct Convolution LUT-based Winograd Convolution Max  ... 
arXiv:2107.05307v2 fatcat:5fxhco3mtfgwzoukazgwvpfhxu

Recent Advances in Convolutional Neural Network Acceleration [article]

Qianru Zhang, Meng Zhang, Tinghuan Chen, Zhifei Sun, Yuzhe Ma, Bei Yu
2018 arXiv   pre-print
We also analyze the acceleration methods in terms of CNN architecture compression, algorithm optimization, and hardware-based improvement.  ...  Two of the feature properties, local connectivity and weight sharing, can reduce the number of parameters and increase processing speed during training and inference.  ...  Feed-forward Efficient Convolution Three methods are summarized for the feed-forward efficient convolution including im2col-based algorithm, Winograd based method, and FFT based method, with the most commonly  ... 
arXiv:1807.08596v1 fatcat:jx66ekaofjhqzdbaueal476bvi
« Previous Showing results 1 — 15 out of 190 results