Filters








69 Hits in 6.0 sec

An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration [article]

Andreas Bytyn, Rainer Leupers, Gerd Ascheid
2019 arXiv   pre-print
Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic.  ...  Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set.  ...  The authors believe that an Application-Specific Instruction Set Processor (ASIP), as presented in this paper, can offer a decent tradeoff between flexibility and efficiency.  ... 
arXiv:1904.05106v1 fatcat:7hat33m75zgbngbr72pgg7mmye

ConvAix: An Application-Specific Instruction-Set Processor for the Efficient Acceleration of CNNs

Andreas Bytyn, Rainer Leupers, Gerd Ascheid
2020 IEEE Open Journal of Circuits and Systems  
INDEX TERMS Application-specific instruction-set processor (ASIP), convolutional neural network (CNN), very large instruction word (VLIW), quantization, low-precision computing, instruction-set architecture  ...  ConvAix is an application-specific instruction-set processor (ASIP) that enables the energyefficient processing of convolutional neural networks (CNNs) while retaining substantial flexibility through its  ...  Envision is comprised of an application-specific instruction-set processor (ASIP) with tightly integrated 2-D processing array that features aforementioned dynamic-precision multipliers, which have zero-guarding  ... 
doi:10.1109/ojcas.2020.3037758 fatcat:vmfn45mzcng6dgrzry7vblf4ky

In-Memory Data Parallel Processor

Daichi Fujiki, Scott Mahlke, Reetuparna Das
2018 Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '18  
A compact instruction set provides generalized computation capabilities for the memory array.  ...  Our results demonstrate 7.5× speedup over a multi-core CPU server for a set of applications from Parsec and 763× speedup over a server-class GPU for a set of Rodinia benchmarks.  ...  Acknowledgments We thank members of M-Bits research group and the anonymous reviewers for their feedback.  ... 
doi:10.1145/3173162.3173171 dblp:conf/asplos/FujikiMD18 fatcat:vxzdd2jdqnbrdnlq4ypdsedkzm

A Configurable Heterogeneous Multicore Architecture With Cellular Neural Network for Real-Time Object Recognition

Kwanho Kim, Seungjin Lee, Joo-Young Kim, Minsu Kim, Hoi-Jun Yoo
2009 IEEE transactions on circuits and systems for video technology (Print)  
In this paper, a configurable heterogeneous multicore architecture with a dual-mode linear processor array and a cellular neural network on the networkon-chip platform is presented for real-time object  ...  The cellular neural network is utilized to accelerate the visual attention algorithm for selecting salient image regions rapidly.  ...  The CNN operation is characterized by a set of template parameters.  ... 
doi:10.1109/tcsvt.2009.2031516 fatcat:oczezwycrrdr5frnus2gfgtb3e

AI Benchmark: Running Deep Neural Networks on Android Smartphones [chapter]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, Luc Van Gool
2019 Lecture Notes in Computer Science  
We give an overview of the hardware acceleration resources available on four main mobile chipset platforms: Qualcomm, HiSilicon, MediaTek and Samsung.  ...  Additionally, we present the real-world performance results of different mobile SoCs collected with AI Benchmark 6 that are covering all main existing hardware configurations.  ...  applications, and an interface for pinning target operations on a specific hardware accelerator like GPU or APU.  ... 
doi:10.1007/978-3-030-11021-5_19 fatcat:vxurra2fmbf2xigbbthpwzmgta

AI Benchmark: Running Deep Neural Networks on Android Smartphones [article]

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, Luc Van Gool
2018 arXiv   pre-print
We give an overview of the hardware acceleration resources available on four main mobile chipset platforms: Qualcomm, HiSilicon, MediaTek and Samsung.  ...  Additionally, we present the real-world performance results of different mobile SoCs collected with AI Benchmark that are covering all main existing hardware configurations.  ...  applications, and an interface for pinning target operations on a specific hardware accelerator like GPU or APU.  ... 
arXiv:1810.01109v2 fatcat:ad76mlp7vjdyddzm5sesq3cmee

Extending the RISC-V ISA for Efficient RNN-based 5G Radio Resource Management [article]

Renzo Andri, Tomas Henriksson, Luca Benini
2020 arXiv   pre-print
In this paper, we investigate RNN inference acceleration by tuning both the instruction set and micro-architecture of a micro-controller-class open-source RISC-V core.  ...  Programmable solutions are desirable for effective 5G-RRM top cope with the rapidly evolving landscape of RNN variations.  ...  e.g., Google's TPU cores) to embedded platforms (e.g., Nvidia Jetson Xavier) to stand-alone application-specific accelerators [18] .  ... 
arXiv:2002.12877v2 fatcat:vhapqb2vubd6fegnk4oo5dvcle

CNNdroid

Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, Soheil Ghiasi
2016 Proceedings of the 2016 ACM on Multimedia Conference - MM '16  
We present a GPU-accelerated library, dubbed CNNdroid, for execution of trained deep CNNs on Android-based mobile devices.  ...  Many mobile applications running on smartphones and wearable devices would potentially benefit from the accuracy and scalability of deep CNN-based machine learning algorithms.  ...  Figure 2 : 2 Example: Exynos 5433 mobile processor with ARM A53 / A57 CPU and Mali T-760 GPU (SC: Shader Core, VLIW: Very Long Instruction Word, SIMD: Single Instruction Multiple Data).  ... 
doi:10.1145/2964284.2973801 dblp:conf/mm/OskoueiGHG16 fatcat:kxsw7qyp6jaqrjg25wpvnizxuu

Design and Implementation of Deep Neural Network for Edge Computing

Junyang ZHANG, Yang GUO, Xiao HU, Rongzhen LI
2018 IEICE transactions on information and systems  
For an edge oriented computing vector processor, combined with a specific neural network model, a new data layout method for putting the input feature maps in DDR, rearrangement of the convolutional kernel  ...  Experimental results show that the vector processor has better computing advantages than CPU and GPU, and can calculate large-scale neural network model in real time.  ...  Vector processor cores support 11 launches and variable-length VLIW instruction, including five scalar and six vector instructions, and the instruction dispatch unit to identify and dispatch the package  ... 
doi:10.1587/transinf.2018edp7044 fatcat:2qmt4l76grebbiwwt54smmqlyq

A domain-specific supercomputer for training deep neural networks

Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, David Patterson
2020 Communications of the ACM  
With 2D vector registers and compute units in TPUv2/v3, the layout of data in both compute units and memory is critical to performance, perhaps more than for a vector or SIMD processor.  ...  The 322-bit VLIW instruction can launch eight operations: two scalar, two vector ALU, vector load and store, and a pair of slots that queue data to and from the matrix available between two halves of a  ...  Moreover, TPU supercomputers with 256-1,024 chips running a production application have 5x-10x performance/ Watt of the #1 traditional supercomputer on the Green500 list running Linpack and 24x-44x of  ... 
doi:10.1145/3360307 fatcat:xomnv3wdebdxphccmfhccapcwa

Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs [article]

R. Tapiador, A. Rios-Navarro, A. Linares-Barranco, Minkyu Kim, Deepak Kadetotad, Jae-sun Seo
2016 arXiv   pre-print
Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel  ...  This makes FPGAs potentially powerful solutions for real-time classification of CNNs.  ...  From single core CPUs and DSP (Digital Signal Processors), well oriented to single-instruction-multiple-data (SIMD) vectored architectures, computing market changed to multicore chips in the early year  ... 
arXiv:1609.09296v1 fatcat:fgowcrakozdmxoaq4eoutnwlvy

Advancements in Microprocessor Architecture for Ubiquitous AI—An Overview on History, Evolution, and Upcoming Challenges in AI Implementation

Fatima Hameed Khan, Muhammad Adeel Pasha, Shahid Masud
2021 Micromachines  
Recently, application-specific instruction-set architecture for AI applications has also been supported in different microprocessors.  ...  In tandem with the emergence of multicore processors, ML techniques started to be embedded in a range of scenarios and applications.  ...  Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/mi12060665 fatcat:edbpii37wfgnxamx76k42qldsq

Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision [article]

Yuhao Zhu, Anand Samajdar, Matthew Mattina, Paul Whatmough
2018 arXiv   pre-print
Specifically, we propose to expose the motion data that is naturally generated by the Image Signal Processor (ISP) early in the vision pipeline to the CNN engine.  ...  We first propose an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by continuous vision applications.  ...  For instance, Movidius Myriad 2 [41] is a VLIW-based vision processor used in Google Clip camera [17] and DJI Phantom 4 drone [33] . Clemons et al.  ... 
arXiv:1803.11232v1 fatcat:kil3eeq3vfa3tgimbc4h4zi5di

PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision

Francesco Conti, Davide Rossi, Antonio Pullini, Igor Loi, Luca Benini
2015 Journal of Signal Processing Systems  
We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up  ...  To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy  ...  Optical flow benchmark As a representative application for the usage of PULP as an accelerator for an autonomous nano-UAV, we developed an optical flow benchmark that is meant to be integrated in the drone  ... 
doi:10.1007/s11265-015-1070-9 fatcat:o27pg4kiq5cbvchjhunk7gqx3y

Evaluation of Optimized CNNs on Heterogeneous Accelerators using a Novel Benchmarking Approach

Michaela Blott, Nicholas Fraser, Giulio Gambardella, Lisa Halder, Johannes Kath, Zachary Neveu, Yaman Umuroglu, Alina Vasilciuc, Miriam Leeser, Linda Doyle
2020 IEEE transactions on computers  
We benchmark across a spectrum of FPGA, GPU, TPU and VLIW processors for systematically pruned and quantized neural networks (ResNet50, GoogLeNetv1, MobileNetv1, a VGG derivative, a multilayer perceptron  ...  ) over many deployment options, considering power, latency, and throughput at a specific accuracy.  ...  In order to help drive clarity in this space, we systematically evaluate and compare pruned and quantized variants of the same CNNs on numerous FPGA accelerators, GPU, CPU, TPU and VLIW processors across  ... 
doi:10.1109/tc.2020.3022318 fatcat:qbzrs75ierayvovbibwm2ezcge
« Previous Showing results 1 — 15 out of 69 results