A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration
[article]
2019
arXiv
pre-print
Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. ...
Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set. ...
The authors believe that an Application-Specific Instruction Set Processor (ASIP), as presented in this paper, can offer a decent tradeoff between flexibility and efficiency. ...
arXiv:1904.05106v1
fatcat:7hat33m75zgbngbr72pgg7mmye
ConvAix: An Application-Specific Instruction-Set Processor for the Efficient Acceleration of CNNs
2020
IEEE Open Journal of Circuits and Systems
INDEX TERMS Application-specific instruction-set processor (ASIP), convolutional neural network (CNN), very large instruction word (VLIW), quantization, low-precision computing, instruction-set architecture ...
ConvAix is an application-specific instruction-set processor (ASIP) that enables the energyefficient processing of convolutional neural networks (CNNs) while retaining substantial flexibility through its ...
Envision is comprised of an application-specific instruction-set processor (ASIP) with tightly integrated 2-D processing array that features aforementioned dynamic-precision multipliers, which have zero-guarding ...
doi:10.1109/ojcas.2020.3037758
fatcat:vmfn45mzcng6dgrzry7vblf4ky
In-Memory Data Parallel Processor
2018
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '18
A compact instruction set provides generalized computation capabilities for the memory array. ...
Our results demonstrate 7.5× speedup over a multi-core CPU server for a set of applications from Parsec and 763× speedup over a server-class GPU for a set of Rodinia benchmarks. ...
Acknowledgments We thank members of M-Bits research group and the anonymous reviewers for their feedback. ...
doi:10.1145/3173162.3173171
dblp:conf/asplos/FujikiMD18
fatcat:vxzdd2jdqnbrdnlq4ypdsedkzm
A Configurable Heterogeneous Multicore Architecture With Cellular Neural Network for Real-Time Object Recognition
2009
IEEE transactions on circuits and systems for video technology (Print)
In this paper, a configurable heterogeneous multicore architecture with a dual-mode linear processor array and a cellular neural network on the networkon-chip platform is presented for real-time object ...
The cellular neural network is utilized to accelerate the visual attention algorithm for selecting salient image regions rapidly. ...
The CNN operation is characterized by a set of template parameters. ...
doi:10.1109/tcsvt.2009.2031516
fatcat:oczezwycrrdr5frnus2gfgtb3e
AI Benchmark: Running Deep Neural Networks on Android Smartphones
[chapter]
2019
Lecture Notes in Computer Science
We give an overview of the hardware acceleration resources available on four main mobile chipset platforms: Qualcomm, HiSilicon, MediaTek and Samsung. ...
Additionally, we present the real-world performance results of different mobile SoCs collected with AI Benchmark 6 that are covering all main existing hardware configurations. ...
applications, and an interface for pinning target operations on a specific hardware accelerator like GPU or APU. ...
doi:10.1007/978-3-030-11021-5_19
fatcat:vxurra2fmbf2xigbbthpwzmgta
AI Benchmark: Running Deep Neural Networks on Android Smartphones
[article]
2018
arXiv
pre-print
We give an overview of the hardware acceleration resources available on four main mobile chipset platforms: Qualcomm, HiSilicon, MediaTek and Samsung. ...
Additionally, we present the real-world performance results of different mobile SoCs collected with AI Benchmark that are covering all main existing hardware configurations. ...
applications, and an interface for pinning target operations on a specific hardware accelerator like GPU or APU. ...
arXiv:1810.01109v2
fatcat:ad76mlp7vjdyddzm5sesq3cmee
Extending the RISC-V ISA for Efficient RNN-based 5G Radio Resource Management
[article]
2020
arXiv
pre-print
In this paper, we investigate RNN inference acceleration by tuning both the instruction set and micro-architecture of a micro-controller-class open-source RISC-V core. ...
Programmable solutions are desirable for effective 5G-RRM top cope with the rapidly evolving landscape of RNN variations. ...
e.g., Google's TPU cores) to embedded platforms (e.g., Nvidia Jetson Xavier) to stand-alone application-specific accelerators [18] . ...
arXiv:2002.12877v2
fatcat:vhapqb2vubd6fegnk4oo5dvcle
We present a GPU-accelerated library, dubbed CNNdroid, for execution of trained deep CNNs on Android-based mobile devices. ...
Many mobile applications running on smartphones and wearable devices would potentially benefit from the accuracy and scalability of deep CNN-based machine learning algorithms. ...
Figure 2 : 2 Example: Exynos 5433 mobile processor with ARM A53 / A57 CPU and Mali T-760 GPU (SC: Shader Core, VLIW: Very Long Instruction Word, SIMD: Single Instruction Multiple Data). ...
doi:10.1145/2964284.2973801
dblp:conf/mm/OskoueiGHG16
fatcat:kxsw7qyp6jaqrjg25wpvnizxuu
Design and Implementation of Deep Neural Network for Edge Computing
2018
IEICE transactions on information and systems
For an edge oriented computing vector processor, combined with a specific neural network model, a new data layout method for putting the input feature maps in DDR, rearrangement of the convolutional kernel ...
Experimental results show that the vector processor has better computing advantages than CPU and GPU, and can calculate large-scale neural network model in real time. ...
Vector processor cores support 11 launches and variable-length VLIW instruction, including five scalar and six vector instructions, and the instruction dispatch unit to identify and dispatch the package ...
doi:10.1587/transinf.2018edp7044
fatcat:2qmt4l76grebbiwwt54smmqlyq
A domain-specific supercomputer for training deep neural networks
2020
Communications of the ACM
With 2D vector registers and compute units in TPUv2/v3, the layout of data in both compute units and memory is critical to performance, perhaps more than for a vector or SIMD processor. ...
The 322-bit VLIW instruction can launch eight operations: two scalar, two vector ALU, vector load and store, and a pair of slots that queue data to and from the matrix available between two halves of a ...
Moreover, TPU supercomputers with 256-1,024 chips running a production application have 5x-10x performance/ Watt of the #1 traditional supercomputer on the Green500 list running Linpack and 24x-44x of ...
doi:10.1145/3360307
fatcat:xomnv3wdebdxphccmfhccapcwa
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
[article]
2016
arXiv
pre-print
Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel ...
This makes FPGAs potentially powerful solutions for real-time classification of CNNs. ...
From single core CPUs and DSP (Digital Signal Processors), well oriented to single-instruction-multiple-data (SIMD) vectored architectures, computing market changed to multicore chips in the early year ...
arXiv:1609.09296v1
fatcat:fgowcrakozdmxoaq4eoutnwlvy
Advancements in Microprocessor Architecture for Ubiquitous AI—An Overview on History, Evolution, and Upcoming Challenges in AI Implementation
2021
Micromachines
Recently, application-specific instruction-set architecture for AI applications has also been supported in different microprocessors. ...
In tandem with the emergence of multicore processors, ML techniques started to be embedded in a range of scenarios and applications. ...
Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/mi12060665
fatcat:edbpii37wfgnxamx76k42qldsq
Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision
[article]
2018
arXiv
pre-print
Specifically, we propose to expose the motion data that is naturally generated by the Image Signal Processor (ISP) early in the vision pipeline to the CNN engine. ...
We first propose an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by continuous vision applications. ...
For instance, Movidius Myriad 2 [41] is a VLIW-based vision processor used in Google Clip camera [17] and DJI Phantom 4 drone [33] . Clemons et al. ...
arXiv:1803.11232v1
fatcat:kil3eeq3vfa3tgimbc4h4zi5di
PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision
2015
Journal of Signal Processing Systems
We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up ...
To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy ...
Optical flow benchmark As a representative application for the usage of PULP as an accelerator for an autonomous nano-UAV, we developed an optical flow benchmark that is meant to be integrated in the drone ...
doi:10.1007/s11265-015-1070-9
fatcat:o27pg4kiq5cbvchjhunk7gqx3y
Evaluation of Optimized CNNs on Heterogeneous Accelerators using a Novel Benchmarking Approach
2020
IEEE transactions on computers
We benchmark across a spectrum of FPGA, GPU, TPU and VLIW processors for systematically pruned and quantized neural networks (ResNet50, GoogLeNetv1, MobileNetv1, a VGG derivative, a multilayer perceptron ...
) over many deployment options, considering power, latency, and throughput at a specific accuracy. ...
In order to help drive clarity in this space, we systematically evaluate and compare pruned and quantized variants of the same CNNs on numerous FPGA accelerators, GPU, CPU, TPU and VLIW processors across ...
doi:10.1109/tc.2020.3022318
fatcat:qbzrs75ierayvovbibwm2ezcge
« Previous
Showing results 1 — 15 out of 69 results