Filters








25 Hits in 8.1 sec

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference [article]

Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina
2020 arXiv   pre-print
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM).  ...  The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry.  ...  The systolic array (SA) is a specialpurpose processor for efficiently accelerating GEMM.  ... 
arXiv:2005.08098v1 fatcat:5ku6lcly7bbdnfdtlk3vmpix5m

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [article]

Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina
2020 arXiv   pre-print
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM).  ...  Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead  ...  For mobile computing devices, INT8 CNN inference accelerators demand high energy * authors with equal contribution. efficiency (TOPS/W) and area efficiency (TOPS/mm 2 ) to achieve performance and price  ... 
arXiv:2009.02381v2 fatcat:36cpt7z6vfgsrhh46dli6rnkuq

McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge

Seunghwan Cho, Haerang Choi, Eunhyeok Park, Hyunsung Shin, Sungjoo Yoo
2020 IEEE Access  
presented an architectural solution for the adoption of a 2D systolic array structure inside the DRAM cell die for large DNNs.  ...  The mobile 8-Core ARM v8.2 64-Bit CPU is equipped with an 8 MB L2 cache and 4 MB L3 cache, and the mobile 512-core Volta GPU includes the Tensor Cores and deep learning accelerator (DLA) for integer GEMM  ... 
doi:10.1109/access.2020.3011265 fatcat:hmfnggvh7nbglef3dkzkuxjloa

S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration [article]

Zhi-Gang Liu, Paul N. Whatmough, Yuhao Zhu, Matthew Mattina
2022 arXiv   pre-print
Prior sparse CNN accelerators largely exploit un-structured sparsity and achieve significant speedups.  ...  Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices.  ...  The additional buffering structures significantly increase the energy and area overhead. Fig. 1 shows the energy breakdown of an INT8 dense systolic array accelerator for a typical CNN layer.  ... 
arXiv:2107.07983v2 fatcat:oghwyenbmffflikfvusd4xm2pu

SPOTS: An Accelerator for Sparse Convolutional Networks Leveraging Systolic General Matrix-Matrix Multiplication [article]

Mohammadreza Soltaniyeh, Richard P. Martin, Santosh Nagarakatte
2021 arXiv   pre-print
We propose a tall systolic array for the GEMM unit while also providing the ability to organize it as multiple small GEMM units, which enables our design to handle a wide range of CNNs and their parameters  ...  This paper proposes a new hardware accelerator for sparse convolutional neural networks (CNNs) by building a hardware unit to perform the Image to Column (IM2COL) transformation of the input feature map  ...  Hence, this paper focuses on designing an efficient hardware accelerator for CNN's inference task targeting edge devices. Accelerating convolutional neural networks.  ... 
arXiv:2107.13386v2 fatcat:k7oampka5rdztojmmwrr2yvnfm

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity [article]

Cong Guo and Bo Yang Hsueh and Jingwen Leng and Yuxian Qiu and Yue Guan and Zehuan Wang and Xiaoying Jia and Xipeng Li and Minyi Guo and Yuhao Zhu
2020 arXiv   pre-print
We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to  ...  Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations.  ...  ACKNOWLEDGEMENT We thank the anonymous reviewers for their constructive feedback for improving the work.  ... 
arXiv:2008.13006v1 fatcat:5r3luayqivaafbt5daho4rulmi

Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights [article]

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, Baoxin Li
2021 arXiv   pre-print
structured sparsity can improve storage efficiency and balance computations; understanding how to compile and map models with sparse tensors on the accelerators; understanding recent design trends for  ...  This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators.  ...  of sparse tensors: Designers may opt for accelerators that are effective for structured computations of dense tensors, e.g., systolic arrays (as near-data accelerators or coupled to processor cores) and  ... 
arXiv:2007.00864v2 fatcat:k4o2xboh4vbudadfiriiwjp7uu

Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors [article]

Lukas Baischer, Matthias Wess, Nima TaheriNejad
2021 arXiv   pre-print
In particular, we focus on acceleration of the inference of convolutional neural networks (CNNs) used for image recognition tasks. Given that there exist many different hardware architectures.  ...  For this reason, optimized hardware accelerators are used to increase the performance of the inference of neuronal networks.  ...  Typical hardware architectures used in an FPGA-based hardware accelerator. a Typical structure of systolic array. PEs can communicate with adjacent PEs.  ... 
arXiv:2104.09252v1 fatcat:625wtuskhff3lbswhwmj7decni

2020 Index IEEE Computer Architecture Letters Vol. 19

2021 IEEE computer architecture letters  
-June 2020 63-67 Systolic arrays Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference. Liu, Z., +, LCA Jan.  ...  ., +, LCA July-Dec. 2020 147-150 Inference mechanisms GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers.  ... 
doi:10.1109/lca.2020.3048555 fatcat:rpa2p25anzftjljygpm67ytioe

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support [article]

Seock-Hwan Noh, Jahyun Koo, Seunghyun Lee, Jongse Park, Jaeha Kung
2022 arXiv   pre-print
While several prior works proposed such multi-precision support for DNN accelerators, not only do they focus only on the inference, but also their core utilization is suboptimal at a fixed precision and  ...  , and gradient tensors.  ...  SIGMA [48] proposes a training accelerator that handles both sparsity and irregular structure in GEMM operations by using a Benes network for efficient workload distribution.  ... 
arXiv:2203.06673v1 fatcat:xzsduig2mndbxitohkvq67b374

An Efficient Hardware Design for Accelerating Sparse CNNs with NAS-based Models

Yun Liang, Liqiang Lu, Yicheng Jin, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin
2021 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
In this work, we propose an accelerator with softwarehardware co-design for sparse CNNs on FPGAs.  ...  On the other hand, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference.  ...  Moreover, we design an efficient accelerator for sparse CNNs that features a tile look-up table (TLUT) and a channel multiplexer (CMUX).  ... 
doi:10.1109/tcad.2021.3066563 fatcat:vxqd4ez64zgxxcwuy5uq2txmpy

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

Seyed Morteza Nabavinejad, Mohammad Baharloo, Kun-Chih Chen, Maurizio Palesi, Tim Kogel, Masoumeh Ebrahimi
2020 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
As a result, efficient interconnection and data movement mechanisms for future on-chip artificial intelligence (AI) accelerators are worthy of study.  ...  Currently, a large body of research aims to find an efficient on-chip interconnection to achieve low-power and high-bandwidth DNN computing.  ...  SIMGA outperforms cutting-edge sparse accelerators by 3x and performs better than systolic array architectures by 5.7x for irregular sparse matrices. III.  ... 
doi:10.1109/jetcas.2020.3022920 fatcat:idqitgwnrnegbd4dhrly3xsxbi

Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique
2020 IEEE Access  
prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process.  ...  In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL  ...  It can be realized in hardware with a 2D systolic array.  ... 
doi:10.1109/access.2020.3039858 fatcat:nticzqgrznftrcji4krhyjxudu

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs [article]

Petar Jokic, Erfan Azarkhish, Andrea Bonetti, Marc Pons, Stephane Emery, Luca Benini
2021 arXiv   pre-print
Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators  ...  This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress.  ...  Replacing single-PE systolic arrays with tensor-PEs, each one computing an entire matrix multiplication per cycle, further allows reducing area and power in a 16nm process by 2.1x and 1.4x, respectively  ... 
arXiv:2106.12810v1 fatcat:gx7cspazc5fdfoi64t2zjth7am

A Survey of FPGA-Based Neural Network Accelerator [article]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang
2018 arXiv   pre-print
In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used.  ...  An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future  ...  [67] use the systolic array structure in their design. e shared data are transferred from one computation unit to the next in a chain mode.  ... 
arXiv:1712.08934v3 fatcat:vbrf3s27e5gdtcr7uzg3smbpli
« Previous Showing results 1 — 15 out of 25 results