A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System
2021
Sensors
Accelerating the compute-intensive DNN inference is, therefore, of utmost importance. ...
As the physical limitation of sensing devices, the design of processor needs to meet the balanced performance metrics, including low power consumption, low latency, and flexible configuration. ...
Acknowledgments: We thank Hui Qiang, Xin Li and Jiaying Yang for their assistance in providing the experimental data.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/s21196491
pmid:34640811
fatcat:mbr2d5mggrhhje24dvgdzo4yqu
HyGCN: A GCN Accelerator with Hybrid Architecture
[article]
2020
arXiv
pre-print
Third, we optimize the overall system via inter-engine pipeline for inter-phase fusion and priority-based off-chip memory access coordination to improve off-chip bandwidth utilization. ...
Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA V100 GPU, our work achieves on average 1509× speedup with 2500× energy reduction and average 6.5× speedup with 10 ...
Acknowledgments We thank the anonymous reviewers of HPCA 2020 and the sealer in Scalable Energy-efficient Architecture Lab (SEAL) for their constructive and insightful comments. ...
arXiv:2001.02514v1
fatcat:uts223fpivefhh4lmrcyg7asuy
A Survey on Graph Processing Accelerators: Challenges and Opportunities
[article]
2019
arXiv
pre-print
Despite a wealth of existing efforts on developing graph processing systems for improving the performance and/or energy efficiency on traditional architectures, dedicated hardware solutions, also referred ...
Specifically, we review the relevant techniques in three core components toward a graph processing accelerator: preprocessing, parallel graph computation and runtime scheduling. ...
To support large-scale of graphs, hybrid CPU-GPU systems [64, 65] , multi-GPUs systems [19, 66] and out-ofmemory systems [67, 68] have been proposed. Remarks. ...
arXiv:1902.10130v1
fatcat:p5lzlf3gubckfpu4eowgo4myi4
Communication-Efficient Edge AI: Algorithms and Systems
[article]
2020
arXiv
pre-print
By pushing inference and training processes of AI models to edge nodes, edge AI has emerged as a promising alternative. ...
We then introduce communication-efficient techniques, from both algorithmic and system perspectives for training and inference tasks at the network edge. ...
Zhi Ding from the University of California at Davis for insightful and constructive comments to improve the presentation of this work. ...
arXiv:2002.09668v1
fatcat:nhasdzb7t5dt5brs2r7ocdzrnm
Understanding and Optimizing Packed Neural Network Training for Hyper-Parameter Tuning
[article]
2021
arXiv
pre-print
of the pack primitive largely depends on a number of factors including memory capacity, chip architecture, neural network structure, and batch size; (3) there exists a trade-off between packing and unpacking ...
when training multiple neural network models on limited resources; (4) a pack-aware Hyperband is up to 2.7x faster than the original Hyperband, with this improvement growing as memory size increases and ...
(2) The benefits of the pack primitive largely depend on a number of factors including memory capacity, chip architecture, neural network structure, batch size, and data preprocessing overlap. (3) There ...
arXiv:2002.02885v4
fatcat:bzrbaltbzzfnvbsrelouhqkxoq
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
[article]
2020
arXiv
pre-print
As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. ...
Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable ...
For each experiment run, we begin with an SLO of 2.9 ms (1× the execution latency of batch-1 ResNet50 inference). ...
arXiv:2006.02464v2
fatcat:f7quwroge5hmxpw66a5oujqhhu
MLPerf Tiny Benchmark
[article]
2021
arXiv
pre-print
MLPerf Tiny measures the accuracy, latency, and energy of machine learning inference to properly evaluate the tradeoffs between systems. ...
Additionally, MLPerf Tiny implements a modular design that enables benchmark submitters to show the benefits of their product, regardless of where it falls on the ML deployment stack, in a fair and reproducible ...
The dataset is divided into five training batches and one testing batch, each with 10000 images. ...
arXiv:2106.07597v4
fatcat:ps4y36uq4nevxfbe7p3tne4opu
Fast convolutional neural networks on FPGAs with hls4ml
[article]
2021
arXiv
pre-print
used in trigger and data acquisition systems of particle detectors. ...
We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on FPGAs. ...
targeting Xilinx system-on-chips (SoCs). ...
arXiv:2101.05108v2
fatcat:3prsiiuypjew5lovb3kwv5gviq
Deep Learning for Mobile Multimedia
2017
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
, speech-to-text translation, media information retrieval, multi-modal data analysis, and so on. ...
bandwidth, and so on. is brings the need of more e cient DNN technologies, which can cope with the constraints of mobile multimedia. ...
In a cuDNN implementation, NVIDIA focuses on on-chip memory and processing since o -chip memory is much expensive. e authors of [29] implement input fetching to hide the memory latency with the data ...
doi:10.1145/3092831
fatcat:ez2fcgckhjawlfywyecest4jqy
Applications and Techniques for Fast Machine Learning in Science
[article]
2021
arXiv
pre-print
This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. ...
The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for ...
KEY AREAS OF OVERLAP Real-time, accelerated AI inference show promises in improving the discovery potential at current and planned scientific instruments across the domains as detailed in Sec. 2. ...
arXiv:2110.13041v1
fatcat:cvbo2hmfgfcuxi7abezypw2qrm
Applying CNN on a scientific application accelerator based on dataflow architecture
2019
CCF Transactions on High Performance Computing
The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely 2.29 × higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware ...
However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. ...
Without data dependency between contexts, multiple contexts in loop-in-pipeline mode execute on the PEs in a pipelined and paralleled manner. ...
doi:10.1007/s42514-019-00015-7
fatcat:4n5kyzorsfdvph3uuvaaz65chi
Accelerating Spike-by-Spike Neural Networks on FPGA with Hybrid Custom Floating-Point and Logarithmic Dot-Product Approximation
2021
IEEE Access
ACKNOWLEDGMENTS This work is funded by the Consejo Nacional de Ciencia y Tecnologia -CONACYT (the Mexican National Council for Science and Technology). ...
values as outputs; the second one is represented by more complex architectures as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) using continuous activation functions; while the ...
INTRODUCTION T HE exponential improvement in computing performance and the availability of large amounts of data are boosting the use of artificial intelligence (AI) applications in our daily lives. ...
doi:10.1109/access.2021.3085216
fatcat:dxvv2cvc5zdv5hxhwe2wew2wsi
Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights
[article]
2021
arXiv
pre-print
structured sparsity can improve storage efficiency and balance computations; understanding how to compile and map models with sparse tensors on the accelerators; understanding recent design trends for ...
This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators. ...
With a boost in energy-efficient accelerations of the learning and inference at the cloud and edge, they can be anticipated to further improve the intelligence of various systems or applications. ...
arXiv:2007.00864v2
fatcat:k4o2xboh4vbudadfiriiwjp7uu
GPU-Based Embedded Intelligence Architectures and Applications
2021
Electronics
This paper gives a comprehensive review and representative studies of the emerging and current paradigms for GPU-based EI with the focus on the architecture, technologies and applications: (1) First, the ...
overview and classifications of GPU-based EI research are presented to give the full spectrum in this area that also serves as a concise summary of the scope of the paper; (2) Second, various architecture ...
GPUs have been utilized to function as hardware accelerators [3] in speeding up training and inference of machine learning, deep learning and AI. ...
doi:10.3390/electronics10080952
fatcat:paubm2sevbhixi2in63ayflmti
An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications
2021
Sensors
Additionally, we discuss the implementation of these algorithms in microcontroller units, mobile devices, and hardware accelerators. ...
Embedded systems technology is undergoing a phase of transformation owing to the novel advancements in computer architecture and the breakthroughs in machine learning applications. ...
Memory Footprint The available on-chip and off-chip memory in embedded systems are very limited compared to the size of ML parameters (synapses and activations) [27] . ...
doi:10.3390/s21134412
pmid:34203119
fatcat:dxmshp4frnf4pcookdy3wjl4fi
« Previous
Showing results 1 — 15 out of 108 results