47 Hits in 3.9 sec

ILP-M Conv: Optimize Convolution Algorithm for Single-Image Convolution Neural Network Inference on Mobile GPUs [article]

Zhuoran Ji
2019 arXiv   pre-print
However, GPU convolution algorithms are designed for mini-batch neural network training, the single-image convolution neural network inference algorithm on mobile GPUs is not well-studied.  ...  The HNTMP convolution algorithm achieves 14.6 × speedup than the most popular im2col convolution algorithm, and 2.30 × speedup than the fastest existing convolution algorithm (direct convolution) as far  ...  The computation of batched manychannels convolution is data-intensive and massively parallel, which is naturally executed on Single-Instruction-Multiple-Data (SIMD) processors.  ... 
arXiv:1909.02765v2 fatcat:tfof36jde5dk7axe7t7zneidby

Accelerating Deep Neural Networks implementation: A survey

Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud
2021 IET Computers & Digital Techniques  
Deploying such Deep Neural Networks (DNN) on embedded devices is still a challenging task considering the massive requirement of computation and storage.  ...  Then, a detailed description of different optimization techniques used in recent research works is explored.  ...  Additionally, the study by Di Cecco et al. [122] implemented a Winograd convolution engine on FPGA which performed 55 GOPs when executing VGG.  ... 
doi:10.1049/cdt2.12016 fatcat:3kl4j5ztl5eahmgv7vetu2egay

WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs [article]

Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, Deming Chen
2021 arXiv   pre-print
In this work, we are the first to propose an optimized Winograd processing element (WinoPE), which can naturally support multiple convolution kernel sizes with the same amount of computing resources and  ...  However, handling arbitrary convolution kernel sizes in FPGA-based Winograd processing elements and supporting efficient data access remain underexplored.  ...  The non-convolution layers are executed in the processors with multi-thread optimization for end-to-end model execution.  ... 
arXiv:2107.04244v1 fatcat:ktwzog53yvbfjhywvysguqqcca

Accelerating Neural Network Inference on FPGA-Based Platforms—A Survey

Ran Wu, Xinmin Guo, Jian Du, Junbao Li
2021 Electronics  
Based on the analysis, we generalize the acceleration strategies into five aspects—computing complexity, computing parallelism, data reuse, pruning and quantization.  ...  Then previous works on neural network acceleration are introduced following these topics. We summarize how to design a technical route for practical applications based on these strategies.  ...  GPUs can afford massive, parallel and pipelined computation which is important in DNN inference.  ... 
doi:10.3390/electronics10091025 doaj:92e7eb4228a44c6387f846a1203529d0 fatcat:2xa7dv5hsjbczpvc4w6acdehwu

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels [article]

John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, Louis Sugy
2019 arXiv   pre-print
These hardware platforms have different performance characteristics and optimization requirements.  ...  Furthermore, tuning for new devices amounts to choosing the combinations of kernel parameters that perform best on the hardware.  ...  A naive parallelization approach on massively parallel architectures is to assign one value of the output C per thread to accumulate the dot product of the i-th row of OP a (A) with the j-th column of  ... 
arXiv:1904.05347v1 fatcat:o3pyytvegjfvpaz4tvwvsebyyu

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator [article]

Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor Aamodt
2019 arXiv   pre-print
tool we observe that cuDNN API calls contain many varying phases and appear to include potentially inefficient microarchitecture behaviour such as DRAM partition bank camping, at least when executed on  ...  Training DNNs requires massive amounts of computational power, which is currently predominantly done with graphics processor units (GPUs).  ...  However, this has a negligible impact on the IPC, since forward convolution with Winograd Nonfused is actually one of the fastest algorithms.  ... 
arXiv:1811.08933v2 fatcat:25pubfhlqndxzkwxniylse45wm

FPGA based technical solutions for high throughput data processing and encryption for 5G communication: A review

P. Visconti, R. Velazquez, Carolina Del-Valle Soto, R. de Fazio
2021 TELKOMNIKA (Telecommunication Computing Electronics and Control)  
The field programmable gate array (FPGA) devices are ideal solutions for high-speed processing applications, given their flexibility, parallel processing capability, and power efficiency.  ...  algorithms by employing the Xilinx Zynq Ultrascale+MPSoC ZCU102 FPGA platform are discussed, and then we introduce our high-speed and lightweight implementation of the wellknown AES-128 algorithm, developed on  ...  faster than the heaviest optimized software implementation on the Intel i5 processor.  ... 
doi:10.12928/telkomnika.v19i4.18400 fatcat:r6zybornqjal7p63o3ct7utkyi


Chao-Tsung Huang, Yu-Chun Ding, Huan-Ching Wang, Chi-Wen Weng, Kai-Ping Lin, Li-Wei Wang, Li-De Chen
2019 Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture - MICRO '52  
Then we devise a coarse-grained instruction set architecture, FBISA, to support power-hungry convolution by massive parallelism.  ...  In this paper, we approach this goal by considering the inference flow, network model, instruction set, and processor design jointly to optimize hardware performance and image quality.  ...  Winograd convolution. It is an efficient algorithm to reduce multipliers for CONV3×3 and recently shows advantages on GPU [34] , FPGA [38] , and embedded processor [58] .  ... 
doi:10.1145/3352460.3358263 dblp:conf/micro/HuangDWWLWC19 fatcat:u3n4eq42orazrpehal6swwxu4y

Integrated photonic FFT for photonic tensor operations towards efficient and high-speed neural networks

Moustafa Ahmed, Yas Al-Hadeethi, Ahmed Bakry, Hamed Dalir, Volker J. Sorger
2020 Nanophotonics  
A sensitivity analysis shows that this optical processor must be thermally phase stabilized corresponding to a few degrees.  ...  The algorithmic executing time is determined by the time-of-flight of the signal through this photonic reconfigurable passive FFT 'filter' circuit and is on the order of 10's of picosecond short.  ...  Harnessing the strengths of optics for emerging processors bears much potential to free circuits from charging wires, while utilizing massive parallelism paradigms [8] . : Photonic Tensor Processing paradigm  ... 
doi:10.1515/nanoph-2020-0055 fatcat:2qqgbmvnd5de3lohohmyfwyz54

Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors [article]

Lukas Baischer, Matthias Wess, Nima TaheriNejad
2021 arXiv   pre-print
FPGA-based implementations are well-suited to show the effect of DNN optimization methods on accuracy and throughput. For this reason, the focus of this work is more on FPGA-based implementations.  ...  In particular, we focus on acceleration of the inference of convolutional neural networks (CNNs) used for image recognition tasks. Given that there exist many different hardware architectures.  ...  Platforms like NVIDIA Jetson combine an ARM CPU with a small power-optimized GPU, for parallel image processing [23] . Overview Table 1 presents an overview of existing Nvidia GPUs.  ... 
arXiv:2104.09252v1 fatcat:625wtuskhff3lbswhwmj7decni

A Survey on the Optimization of Neural Network Accelerators for Micro-AI On-Device Inference

Arnab Neelim Mazumder, Jian Meng, Hasib-Al Rashid, Utteja Kallakuri, Xin Zhang, Jae-sun Seo, Tinoosh Mohsenin
2021 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
the current hardware approaches towards efficient deployment of the micro-AI models on hardware.  ...  Hence, it is becoming increasingly important to scale these DNNs so that they can fit on resource-constrained hardware and edge devices.  ...  Additionally, since MAC operations on an FPGA can be parallelized, parallel compute paradigms including temporal and spatial architectures are explored for highly parallel solutions.  ... 
doi:10.1109/jetcas.2021.3129415 fatcat:nknpy4eernaeljz2hpqafe7sja

Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems [article]

Miguel de Prado, Nuria Pazos, Luca Benini
2018 arXiv   pre-print
We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library.  ...  finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices.  ...  The massive computation that CNNs demand prompts for several optimization approaches for inference on embedded devices.  ... 
arXiv:1811.07315v1 fatcat:iyyg7uupi5huzeseb22nb7ojcy

High-level design using Intel FPGA OpenCL: A hyperspectral imaging spatial-spectral classifier

R. Domingo, R. Salvador, H. Fabelo, D. Madronal, S. Ortega, R. Lazcano, E. Juarez, G. Callico, C. Sanz
2017 2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)  
From a common baseline C implementation running on the embedded ARM ® Cortex ® -A9, OpenCL-based synthesis is evaluated applying different generic and vendor specific optimizations.  ...  , platform-driven optimization is needed.  ...  On-chip memory is used for vectorization and the Winograd transform [8] to further boost performance reducing the number of required multiply-accumulate operations in convolutions.  ... 
doi:10.1109/recosoc.2017.8016152 dblp:conf/recosoc/DomingoSFMOLJCS17 fatcat:3chojidwffa5lp5l2nf6myyhp4

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [article]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
2018 arXiv   pre-print
Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs.  ...  It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.  ...  We would also like to thank members of Sampa, SAMPL and Systems groups at the Allen School for their feedback on the work and manuscript.  ... 
arXiv:1802.04799v3 fatcat:e6htzyqaqjhpnm3yyi6xl3mdoq

Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks

Jack Turner, Jose Cano, Valentin Radu, Elliot J. Crowley, Michael OrBoyle, Amos Storkey
2018 2018 IEEE International Symposium on Workload Characterization (IISWC)  
Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices.  ...  Inference Stack and take an across-stack approach by implementing and evaluating the most common neural network compression techniques (weight pruning, channel pruning, and quantisation) and optimising their parallel  ...  Since OpenMP does not support ARM Mali GPUs, the networks are parallelised only on the CPU of the board using up to 8 threads (cores) of the Cortex-A processor. We developed two OpenCL versions.  ... 
doi:10.1109/iiswc.2018.8573503 dblp:conf/iiswc/TurnerCRCOS18 fatcat:hxxhuovm6fhyhheg55vtwyvsoi
« Previous Showing results 1 — 15 out of 47 results