279 Hits in 7.8 sec

A Streaming Dataflow Engine for Sparse Matrix-Vector Multiplication using High-Level Synthesis

Mohammad Hosseinabady, Jose Luis Nunez-Yanez
2019 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Using high-level synthesis techniques, this paper proposes an adaptable high-performance streaming dataflow engine for sparse matrix dense vector multiplication (SpMV) suitable for embedded FPGAs.  ...  As the SpMV is a memorybound algorithm, this engine combines the three concepts of loop pipelining, dataflow graph, and data streaming to utilize most of the memory bandwidth available to the FPGA.  ...  ACKNOWLEDGMENT The authors would like to thank the support received from EPSRC for this work part of the ENEAC project (EP/N002539/1). The open source code of this research can be found at [21] .  ... 
doi:10.1109/tcad.2019.2912923 fatcat:ljhbd4gedjcuxiayxh3nqkty3i

Dataflow Matrix Machines as a Model of Computations with Linear Streams [article]

Michael Bukatin, Jon Anthony
2017 arXiv   pre-print
We describe vector space of finite prefix trees with numerical leaves which allows us to combine expressive power of dataflow matrix machines with simplicity of traditional recurrent neural networks.  ...  We overview dataflow matrix machines as a Turing complete generalization of recurrent neural networks and as a programming platform.  ...  Dataflow Matrix Machines Based on the Vector Space Generated by Finite Strings The powerful setup described above involves relatively high level of design complexity.  ... 
arXiv:1706.00648v1 fatcat:trqnzvdifba3pceyqiyjuyzsvu

Dataflow Matrix Machines and V-values: a Bridge between Programs and Neural Nets [article]

Michael Bukatin, Jon Anthony
2018 arXiv   pre-print
1) Dataflow matrix machines (DMMs) generalize neural nets by replacing streams of numbers with linear streams (streams supporting linear combinations), allowing arbitrary input and output arities for activation  ...  functions, countable-sized networks with finite dynamically changeable active part capable of unbounded growth, and a very expressive self-referential mechanism. 2) DMMs are suitable for general-purpose  ...  Acknowledgments We would like to thank Dima-David Datjko, Elena Machkasova, and Elena Nekludova for helpful discussions of the material presented in [2] and here.  ... 
arXiv:1712.07447v2 fatcat:wsdoak2w3baspcb6ingtwqctcq

Sparse and dense matrix multiplication hardware for heterogeneous multi-precision neural networks

Jose Nunez-Yanez, Mohammad Hosseinabady
2021 Array  
In this paper, we present hardware accelerators created with high-level synthesis techniques for sparse and dense matrix multiplication operations.  ...  Overall, the balance between sparse and dense performance depends on matrix shape, precision, structural pruning and sparsity levels and performance modelling can be used to balance concurrent execution  ...  A FINN engine implements the matrix-vector products of fully-connected layers or the matrix-matrix products of convolution operations.  ... 
doi:10.1016/j.array.2021.100101 fatcat:nr32njqu4bawdc2dihiqi6mv2e

Capstan: A Vector RDA for Sparsity [article]

Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, Kunle Olukotun
2021 arXiv   pre-print
Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives that can be mapped to vectorized, high-performance hardware.  ...  This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable-dataflow accelerator (RDA) for sparse and dense tensor applications.  ...  Finally, the top-level vector and second-level vectors are processed by nested sparse-sparse scanners. We use a bit-tree implementation for matrix-matrix add. Memory Ordering.  ... 
arXiv:2104.12760v1 fatcat:k7s6dsgikvgixcip2xyrd7eriu

Morphling: A Reconfigurable Architecture for Tensor Computation

Liqiang Lu, Yun Liang
2021 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Furthermore, to efficiently support sparse tensor, we design a tiled-BCSR format that enables high parallelism and balanced workload.  ...  The dense and sparse tensor computation share the same execution model, but differ in the vector computation step where the multiplications are conducted.  ...  KRP: Khatri-Rao product, used in tensor fabrication. SpMM: sparse-sparse matrix multiplication, used in data base, deep learning.  ... 
doi:10.1109/tcad.2021.3135322 fatcat:5omvjoxy3zd7jgaear5swwjlou

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures [article]

Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler
2020 arXiv   pre-print
These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface.  ...  We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization.  ...  Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV).  ... 
arXiv:1902.10345v3 fatcat:4aerjkgf2fguhlbcbrw7g2uw5e


Subhankar Pal, Siying Feng, Dong-hyeon Park, Sung Kim, Aporva Amarnath, Chi-Sheng Yang, Xin He, Jonathan Beaumont, Kyle May, Yan Xiong, Kuba Kaszyk, John Magnus Morton (+8 others)
2020 Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques  
Finally, in order to support programmability and ease-of-adoption, we prototype a software stack composed of low-level runtime routines, and a high-level language library called TransPy, that cater to  ...  This is particularly true for domains that have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machine learning and graph analytics.  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers for their helpful feedback.  ... 
doi:10.1145/3410463.3414627 dblp:conf/IEEEpact/PalFPKAYHBMXKMS20 fatcat:kwsaun2g65b6jl6mdqrhgiv7yq

High Level Synthesis with a Dataflow Architectural Template [article]

Shaoyi Cheng, John Wawrzynek
2016 arXiv   pre-print
In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed.  ...  As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow  ...  For sparse matrix vector (SpMV) multiply-our first kernelcompressed sparse row (CSR) format is used to store the matrix, where the loads of the floating point numbers to be multiplied depend on the data  ... 
arXiv:1606.06451v1 fatcat:zfm7tw5rk5hxrcluxha2gigf3u

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Vasileios Leon, Spyridon Mouselinos, Konstantina Koliogeorgi, Sotirios Xydis, Dimitrios Soudris, Kiamal Pekmestzi
2020 Technologies  
The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their  ...  Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines  ...  As a last level of optimization, we employ the dataflow optimization technique. i.e., the matrix multiplication transformation (MMT).  ... 
doi:10.3390/technologies8010006 fatcat:hxbvytd3tvfe7jzhr4hrovwjmq

Dataflow computing with Polymorphic Registers

Catalin Ciobanu, Georgi Gaydadjiev, Christian Pilato, Donatella Sciuto
2013 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)  
This paper shows how PRFs can be integrated in dataflow computational platforms.  ...  Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data.  ...  Compared to the Cell processor, PRFs decrease the number of instructions for a customized, high performance dense matrix multiplication by up to 35 times [8] and improve performance for Floyd and sparse  ... 
doi:10.1109/samos.2013.6621140 dblp:conf/samos/CiobanuGPS13 fatcat:sspqvt6m4zdm5neapom5beygza

Towards Design Methodology of Efficient Fast Algorithms for Accelerating Generative Adversarial Networks on FPGAs [article]

Jung-Woo Chang, Saehyun Ahn, Keon-Woo Kang, Suk-Ju Kang
2019 arXiv   pre-print
Firstly, we introduce a new class of fast algorithm for DeConv layers using Winograd minimal filtering.  ...  Secondly, we propose a new dataflow to prevent resource underutilization by reorganizing the filter layout in the Winograd domain.  ...  Through this dataflow, the vector-level sparsity, which exists in the transformed filters, is utilized.  ... 
arXiv:1911.06918v1 fatcat:g2gpdp2ikjgjtbyjr334fyrsla

Programmatic Control of a Compiler for Generating High-performance Spatial Hardware [article]

Hongbo Rong
2017 arXiv   pre-print
Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful.  ...  Consequently, high performance is expected with substantially higher productivity: compared with high-performance programming in today's high-level synthesis (HLS) languages or hardware description languages  ...  For example, for SpMV, a high-performance design [15] preprocesses a sparse matrix on the host; with the preprocessed matrix as input, the workload on the device side becomes much like a regular dense  ... 
arXiv:1711.07606v2 fatcat:edmaiggtcfhkzd74yhca3jakiy

A Survey on System-Level Design of Neural Network Accelerators

Kenshu Seto
2021 Journal of Integrated Circuits and Systems  
In this paper, we present a brief survey on the system-level optimizations used for convolutional neural network (CNN) inference accelerators.  ...  In addition, we discuss streaming architectures and single computation engine architectures that are commonly used in CNN accelerators.  ...  An accelerator generator of a single computation engine using high-level synthesis (HLS) is proposed in [21] .  ... 
doi:10.29292/jics.v16i2.505 fatcat:ibbkeob42jepbguezlptws2qha

Architectural synthesis of computational pipelines with decoupled memory access

Shaoyi Cheng, John Wawrzynek
2014 2014 International Conference on Field-Programmable Technology (FPT)  
As high level synthesis (HLS) moves towards mainstream adoption among FPGA designers, it has proven to be an effective method for rapid hardware generation.  ...  The methodology complements existing work in high-level synthesis, easing the creation of heterogeneous systems with high performance accelerators and general purpose processors.  ...  The ASPIRE Lab is funded by DARPA Award Number HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA  ... 
doi:10.1109/fpt.2014.7082758 dblp:conf/fpt/ChengW14 fatcat:wdka47i2qjefpjqumez6nkoiyq
« Previous Showing results 1 — 15 out of 279 results