Filters








231 Hits in 3.3 sec

Automating Optimized Table-with-Polynomial Function Evaluation for FPGAs [chapter]

Dong-U Lee, Oskar Mencer, David J. Pearce, Wayne Luk
2004 Lecture Notes in Computer Science  
On the algorithmic side, MATLAB designs approximation algorithms with polynomial coefficients and minimizes bitwidths.  ...  Function evaluation is at the core of many compute-intensive applications which perform well on reconfigurable platforms.  ...  POLY has the worst delay, which is due to computations involving high-degree polynomials, and the terms of the polynomials increase with the bitwidth.  ... 
doi:10.1007/978-3-540-30117-2_38 fatcat:z3hx6c7ikbeu3bq7vc5rauyjzq

A Flexible In-Memory Computing Architecture for Heterogeneously Quantized CNNs

Flavio Ponzina, Marco Rios, Giovanni Ansaloni, Alexandre Levisse, David Atienza
2021 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)  
Addressing this challenge, in this paper we present a flexible In-Memory Computing (IMC) architecture and circuit, able to scale data representations to varying bitwidths at run-time, while ensuring high  ...  with respect to fixed-bitwidth alternatives, for negligible accuracy degradation.  ...  They cope with the high memory and computational intensity of CNNs, leveraging the regular computing pattern and the data reuse opportunities of inference.  ... 
doi:10.1109/isvlsi51109.2021.00039 fatcat:sg3gan6f6bas5m7ooinw67aycu

Efficient Bitwidth Search for Practical Mixed Precision Neural Network [article]

Yuhang Li, Wei Wang, Haoli Bai, Ruihao Gong, Xin Dong, Fengwei Yu
2020 arXiv   pre-print
Meanwhile, it is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.  ...  However, it is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently.  ...  Despite their success, low efficiency and computational burden are the major drawbacks of these approaches.  ... 
arXiv:2003.07577v1 fatcat:7gaprm7ypvdmlpejaycnp7wwv4

Implementation and evaluation of channel estimation and phase tracking for vehicular networks

Yongjiu Du, Dinesh Rajan, Joseph Camp
2013 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC)  
We also provide a detailed hardware design for a decoder-based channel estimation algorithm with a pipeline structure on an FPGA-based platform.  ...  We also jointly evaluate the packet error rate versus the bitwidth of the data in the FPGA, which is important to achieve a good balance between the hardware cost and the system performance.  ...  In order to investigate the performance with diverse bitwidths during the digital signal processing, we emulate the PER performance with different bitwidths for the implementation of the channel estimation  ... 
doi:10.1109/iwcmc.2013.6583738 dblp:conf/iwcmc/DuRC13a fatcat:njpxroelaveivbia7jdezt4b64

QGTC

Yuke Wang, Boyuan Feng, Yufei Ding
2022 Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming  
Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead.  ...  To this end, we propose the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs.  ...  This inspires us to explore the high-performance GPU hardware features that can efficiently support the QGNN computation.  ... 
doi:10.1145/3503221.3508408 fatcat:7uwousak7vguln7gvgft7aenum

QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core [article]

Yuke Wang and Boyuan Feng and Yufei Ding
2021 arXiv   pre-print
Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead.  ...  To this end, we propose the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs.  ...  This inspires us to explore the high-performance GPU hardware features that can efficiently support the QGNN computation.  ... 
arXiv:2111.09547v5 fatcat:2vw22r3nkjczfe5jaj7leaafp4

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

Chaim Baskin, Evgenii Zheltonozhkii, Tal Rozen, Natan Liss, Yoav Chai, Eli Schwartz, Raja Giryes, Alexander M. Bronstein, Avi Mendelson
2021 Mathematics  
a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices.  ...  This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with as low as 3 bit weights and activations  ...  For low bitwidths, i.e., 3,3, we observe the opposite. The uniform noise assumption is no longer accurate.  ... 
doi:10.3390/math9172144 fatcat:zgrkaxcoazd3vdesjviiv25vca

AdaPT: Fast Emulation of Approximate DNN Accelerators in PyTorch [article]

Dimitrios Danopoulos, Georgios Zervakis, Kostas Siozios, Dimitrios Soudris, Jörg Henkel
2022 arXiv   pre-print
We evaluate the framework on several DNN models and application fields including CNNs, LSTMs, and GANs for a number of approximate multipliers with distinct bitwidth values.  ...  AdaPT can be seamlessly deployed and is compatible with the most DNNs.  ...  The aim of this transformation is to allow a more efficient implementation for hardware acceleration by simply computing a matrix multiplication.  ... 
arXiv:2203.04071v1 fatcat:g64oarve7vaa3c54p3iqgmkl3u

Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model: The LDPC Decoder Case

G. Falcao, M. Owaida, D. Novo, M. Purnaprajna, N. Bellas, C.D. Antonopoulos, G. Karakonstantis, A. Burg, P. Ienne
2012 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines  
Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs.  ...  For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation.  ...  Finally, to evaluate the efficiency of the SOpenCL methodology we used different resource scenarios of hardware availability to guide modulo scheduling of the computational and I/O streaming kernels.  ... 
doi:10.1109/fccm.2012.46 dblp:conf/fccm/FalcaoONPBAKBI12 fatcat:bh5ywb3csvdafai7sorxzkqq5a

Compiling KB-sized machine learning models to tiny IoT devices

Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, Rahul Sharma
2019 Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2019  
FPGA implementations generated using high-level synthesis tools.  ...  SeeDot compiles state-of-the-art KB-sized models to various microcontrollers and low-end FPGAs.  ...  The fixed-point code operates only on low-bitwidth integers and is much more efficient than emulating floating-point in software. Our compiler uses two key ideas.  ... 
doi:10.1145/3314221.3314597 dblp:conf/pldi/GopinathGSS19 fatcat:a6zf4su3rrbpfe4lyplouhx23a

WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [article]

Renkun Ni, Hong-min Chu, Oscar Castañeda, Ping-yeh Chiang, Christoph Studer, Tom Goldstein
2020 arXiv   pre-print
Low-resolution neural networks represent both weights and activations with few bits, drastically reducing the multiplication complexity.  ...  We demonstrate the efficacy of our approach on both software and hardware platforms.  ...  We propose training a network with layers that emulate integer overflows on the fixed-point preactivations z q to maintain high accuracy.  ... 
arXiv:2007.13242v1 fatcat:dlotys7s7fhh3hnswlxa2bipey

Power emulation

Joel Coburn, Srivaths Ravi, Anand Raghunathan
2005 Proceedings of the 42nd annual conference on Design automation - DAC '05  
In this work, we propose a new paradigm called power emulation, which exploits hardware acceleration to drastically speedup power estimation.  ...  , and the use of block memories for efficient storage within power models.  ...  Power emulation harnesses hardware prototyping platforms for power estimation, leading to orders of magnitude efficiency improvements.  ... 
doi:10.1145/1065579.1065764 dblp:conf/dac/CoburnRR05 fatcat:nckok5yqqbgglcfvqsqofh4ulq

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization [article]

Chaim Baskin, Natan Liss, Yoav Chai, Evgenii Zheltonozhskii, Eli Schwartz, Raja Giryes, Avi Mendelson, Alexander M. Bronstein
2018 arXiv   pre-print
Though deep learning leads to groundbreaking performance in these domains, the networks used are very demanding computationally and are far from real-time even on a GPU, which is not power efficient and  ...  This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with low as 3-bit weights and activations  ...  , when using low bitwidth.  ... 
arXiv:1810.00162v2 fatcat:jep74coe3bgg5jminzmhy7l6uq

Power emulation: a new paradigm for power estimation

J. Coburn, S. Ravi, A. Raghunathan
2005 Proceedings. 42nd Design Automation Conference, 2005.  
In this work, we propose a new paradigm called power emulation, which exploits hardware acceleration to drastically speedup power estimation.  ...  , and the use of block memories for efficient storage within power models.  ...  Power emulation harnesses hardware prototyping platforms for power estimation, leading to orders of magnitude efficiency improvements.  ... 
doi:10.1109/dac.2005.193902 fatcat:bxbsaukkbvcxhm5r6owyuekkaa

Multithreaded pipeline synthesis for data-parallel kernels

Mingxing Tan, Bin Liu, Steve Dai, Zhiru Zhang
2014 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)  
To ensure that the synthesized pipeline is complexity effective, we further propose efficient scheduling algorithms for minimizing the hardware overhead associated with context management.  ...  Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels.  ...  This pipelined architecture often leads to high performance and low overhead for a number of reasons.  ... 
doi:10.1109/iccad.2014.7001431 dblp:conf/iccad/Tan0DZ14 fatcat:h2fv3wafcbgsnnmnnd2bxyt2je
« Previous Showing results 1 — 15 out of 231 results