2,753 Hits in 4.3 sec

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization [article]

H. T. Kung and Bradley McDanel and Sai Qian Zhang
2018 arXiv   pre-print
This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations.  ...  By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g., ~4x) due to the increased  ...  The training process works in an iterative fashion, where at each iteration the model is pruned and packed so that it fits more efficiently in the systolic array.  ... 
arXiv:1811.04770v1 fatcat:khgfhilcx5h2xmgos6qlgdlgzi

SaARSP: An Architecture for Systolic-Array Acceleration of Recurrent Spiking Neural Networks

Jeong-Jun Lee, Wenrui Zhang, Yuan Xie, Peng Li
2022 ACM Journal on Emerging Technologies in Computing Systems  
The proposed systolic-array architecture offers a unifying solution to an acceleration of both feedforward and recurrent SNNs, and delivers 4,000X EDP improvement on average for different R-SNN benchmarks  ...  We present the first work to exploit spatiotemporal parallelisms to accelerate the R-SNN based inference on systolic arrays using an architecture called SaARSP.  ...  Operate the array accelerator with a chosen time window size TW for K array processing iterations. 4. 1 . 1 11 Systolic array.  ... 
doi:10.1145/3510854 fatcat:w6gytasmsbau3gvtkrhempy6s4

WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs [article]

Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, Deming Chen
2021 arXiv   pre-print
Using the proposed WinoPE, we construct a highly efficient systolic array accelerator, termed WinoCNN. We also propose a dedicated memory subsystem to optimize the data access.  ...  The combination of Winograd's algorithm and systolic array architecture has demonstrated the capability of improving DSP efficiency in accelerating convolutional neural networks (CNNs) on FPGA platforms  ...  Systolic array architectures are efficient for parallel computing and is widely adopted by FPGA accelerators for matrix multiplications and convolutions [5] [18] .  ... 
arXiv:2107.04244v1 fatcat:ktwzog53yvbfjhywvysguqqcca

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training [article]

Sangkug Lym, Mattan Erez
2020 arXiv   pre-print
FlexSA dynamically reconfigures the systolic array structure and offers multiple sub-systolic operating modes, which are designed for energy- and memory bandwidth-efficient processing of tensors with different  ...  To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture.  ...  For efficient acceleration of these GEMMs, many modern training accelerators adopt large systolic array cores that are (typically) a twodimensional mesh of many simple and efficient processing elements  ... 
arXiv:2004.13027v1 fatcat:6q5aiindzbebzbwixeer4nn7ie

Compiler generated systolic arrays for wavefront algorithm acceleration on FPGAs

Betul Buyukkurt, Walid A. Najj
2008 2008 International Conference on Field Programmable Logic and Applications  
These algorithms are highly computationally intensive and are therefore excellent candidates for FPGA-based code acceleration.  ...  In this paper we describe the transformations performed by ROCCC, which transformed the kernel of the Smith-Waterman algorithm into a hardware systolic array that is mapped onto the FPGA on the SGI Altix  ...  Systolic Array Generation in ROCCC The outer loop in the C code shown in Figure 1 is unrolled to form the processing elements of the systolic array cells.  ... 
doi:10.1109/fpl.2008.4630032 dblp:conf/fpl/BuyukkurtN08 fatcat:qas2izyfljcn7kfv2oa5z6hwia

A Scalable Systolic Accelerator for Estimation of the Spectral Correlation Density Function and its FPGA Implementation

Xiangwei Li, Douglas L. Maskell, Carol Jingyi Li, Philip H.W. Leong, David Boland
2022 ACM Transactions on Reconfigurable Technology and Systems  
The implementation uses a linear systolic array with a bi-directional datapath consisting of DSP-based processing elements (PEs) with a dedicated instruction schedule, achieving a PE utilization of 88.2%  ...  In this paper, we present an efficient FPGA implementation of the FFT accumulation method (FAM) for estimating the SCD function and its alpha profile.  ...  In Section 4, a bi-directional linear systolic array as an FPGA accelerator for FAM is proposed and its mapping scheme is thoroughly explained.  ... 
doi:10.1145/3546181 fatcat:erdazdw245ggljbqrylb27kvgy

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse [article]

Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Yongkee Kwon, Mattan Erez
2019 arXiv   pre-print
Combined, WaveCore and MBS reduce DRAM traffic by 75%, improve performance by 53%, and save 26% system energy for modern deep CNN training compared to conventional training mechanisms and accelerators.  ...  We introduce the MBS CNN training approach that significantly reduces memory traffic by partially serializing mini-batch processing across groups of layers.  ...  Second, we modify a traditional systolic array processing core, as used by some commercial accelerators (Jouppi et al., 2017; Lu et al., 2017) , to better execute the tall and skinny GEMMs needed for  ... 
arXiv:1810.00307v4 fatcat:dtquutnsjfgtza6rpyjjcdnmva

Architecture and design of a hardware accelerator for efficient 3D object recognition using the LC method

Donald L. Hung, Karl Hillesland, Jun Wang
2001 Information Sciences  
To address this issue, we propose a hardware accelerator for solving sets of linear equations based on iterative methods.  ...  To make the LC method suitable for real-time 3D object recognition, the key issue is to expedite the learning process by reducing the time consumed in solving the simultaneous linear equations.  ...  Fig. 3 . 3 Data¯ow diagram of the accelerator with N n architecture (N n 3 as shown). Fig. 4 . 4 Datapath of a PE for the systolic array with N n architecture. D.L.  ... 
doi:10.1016/s0020-0255(00)00078-5 fatcat:afz5luroljfu5ajy5rc55fdgnm

CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse [article]

Alberto Marchisio, Muhammad Abdullah Hanif, Muhammad Shafique
2018 arXiv   pre-print
Our architecture exploits the massive parallelism by flexibly feeding the data to a specialized systolic array according to the operations required in different layers.  ...  State-of-the-art convolutional DNN accelerators would not work efficiently for CapsuleNets, as their designs do not account for key operations involved in CapsuleNets, like squashing and dynamic routing  ...  PE Fig. 11 : 11 Architecture of Different Components of our CapsAcc Accelerator: (a) Systolic Array. (b) A Processing Element of the Systolic Array. (c) Accumulator. (d) Activation Unit.  ... 
arXiv:1811.08932v1 fatcat:72ctodtjwrca7ngmppf3rqttoy

HLS Tools for FPGA: Faster Development with Better Performance [chapter]

Alexandre Cornu, Steven Derrien, Dominique Lavenier
2011 Lecture Notes in Computer Science  
Designing FPGA-based accelerators is a difficult and timeconsuming task which can be softened by the emergence of new generations of High Level Synthesis Tools.  ...  This paper describes how the ImpulseC C-to-hardware compiler tool has been used to develop efficient hardware for a known genomic sequence alignment algorithms and reports HLL designs performance outperforming  ...  Acknowledgment We would like to thank Impulse Accelerated Technologies for their valuable help. This work has been supported by the French ANR BioWIC (ANR-08-SEGI-005).  ... 
doi:10.1007/978-3-642-19475-7_8 fatcat:7q7rfwffjzekzm466vdgsvxdvu

GPU-Accelerated BWA-MEM Genomic Mapping Algorithm Using Adaptive Load Balancing [chapter]

Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars
2016 Lecture Notes in Computer Science  
This paper discusses acceleration of the Seed Extension function on a GPU accelerator.  ...  Hence, acceleration of the algorithms used is of utmost importance.  ...  The authors would like to thank the people at the Neuroscience Department of the Erasmus Medical Center for kindly granting access to their computing facilities for performance tests.  ... 
doi:10.1007/978-3-319-30695-7_10 fatcat:jkvry5jqtrgzzcxe4uwkwe3qyq

TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra [article]

Liancheng Jia, Zizhang Luo, Liqiang Lu, Yun Liang
2021 arXiv   pre-print
In this paper, we propose TensorLib, a framework for generating spatial hardware accelerator for tensor algebra applications.  ...  Our generation framework can select the needed hardware modules for each dataflow, connect the modules using a specified interconnection pattern, and automatically generate the complete hardware accelerator  ...  As shown in Figure 1 , spatial accelerators usually consist of an array of homogeneous processing elements (PEs), an on-chip network that connects PEs Yun Liang is the corresponding author. together,  ... 
arXiv:2104.12339v1 fatcat:feub4nzzezerhkwnwlhbkr5ol4

Sparse Winograd Convolutional neural networks on small-scale systolic arrays [article]

Feng Shi, Haochen Li, Yuhe Gao, Benjamin Kuschner, Song-Chun Zhu
2018 arXiv   pre-print
In this paper, we implement an accelerator on FPGA by combining the sparse Winograd convolution, clusters of small-scale systolic arrays, and a tailored memory layout design.  ...  The reconfigurability, energy-efficiency, and massive parallelism on FPGAs make them one of the best choices for implementing efficient deep learning accelerators.  ...  subtraction, and "0" for passing by the data to next processing element (PE) inside its systolic array.  ... 
arXiv:1810.01973v1 fatcat:g44643olkvh23fkdp5lsraipe4

Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing

Jong Hun Lee, Beomjin Park, Joonho Kong, Arslan Munir
2022 IEEE Access  
According to our evaluation, our 32PE-SpMM accelerator shows 13.6× -47.9× speedup over tensor processing unit (TPU)-like systolic arrays, on average.  ...  Though systolic arrays have shown a significant performance and energy efficiency improvement over central processing units (CPUs) or graphic processing units (GPUs) when executing matrix multiplications  ...  • We will quantitatively evaluate our accelerator and load balancing scheme with the state-of-the-art accelerators; • We will further investigate the impact of non-zero data distribution patterns of the  ... 
doi:10.1109/access.2022.3184116 fatcat:bo2mnnyvxvfyzju2zi7ex5qvzu

Page 4798 of Mathematical Reviews Vol. , Issue 86j [page]

1986 Mathematical Reviews  
A systolic algorithm has many systolic cells which can operate in parallel, and each of them is 86j:65193 NUMERICAL ANALYSIS 4798 implemented by a processing element (PE) in a systolic architec- ture.  ...  in adaptive signal processing.  ... 
« Previous Showing results 1 — 15 out of 2,753 results