Filters








108 Hits in 7.0 sec

A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System

Haifeng Zhang, Xiaoti Wu, Yuyu Du, Hongqing Guo, Chuxi Li, Yidong Yuan, Meng Zhang, Shengbing Zhang
2021 Sensors  
Accelerating the compute-intensive DNN inference is, therefore, of utmost importance.  ...  As the physical limitation of sensing devices, the design of processor needs to meet the balanced performance metrics, including low power consumption, low latency, and flexible configuration.  ...  Acknowledgments: We thank Hui Qiang, Xin Li and Jiaying Yang for their assistance in providing the experimental data. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/s21196491 pmid:34640811 fatcat:mbr2d5mggrhhje24dvgdzo4yqu

HyGCN: A GCN Accelerator with Hybrid Architecture [article]

Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, Yuan Xie
2020 arXiv   pre-print
Third, we optimize the overall system via inter-engine pipeline for inter-phase fusion and priority-based off-chip memory access coordination to improve off-chip bandwidth utilization.  ...  Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA V100 GPU, our work achieves on average 1509× speedup with 2500× energy reduction and average 6.5× speedup with 10  ...  Acknowledgments We thank the anonymous reviewers of HPCA 2020 and the sealer in Scalable Energy-efficient Architecture Lab (SEAL) for their constructive and insightful comments.  ... 
arXiv:2001.02514v1 fatcat:uts223fpivefhh4lmrcyg7asuy

A Survey on Graph Processing Accelerators: Challenges and Opportunities [article]

Chuangyi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xinyu Chen, Xiaofei Liao, Hai Jin
2019 arXiv   pre-print
Despite a wealth of existing efforts on developing graph processing systems for improving the performance and/or energy efficiency on traditional architectures, dedicated hardware solutions, also referred  ...  Specifically, we review the relevant techniques in three core components toward a graph processing accelerator: preprocessing, parallel graph computation and runtime scheduling.  ...  To support large-scale of graphs, hybrid CPU-GPU systems [64, 65] , multi-GPUs systems [19, 66] and out-ofmemory systems [67, 68] have been proposed. Remarks.  ... 
arXiv:1902.10130v1 fatcat:p5lzlf3gubckfpu4eowgo4myi4

Communication-Efficient Edge AI: Algorithms and Systems [article]

Yuanming Shi, Kai Yang, Tao Jiang, Jun Zhang, Khaled B. Letaief
2020 arXiv   pre-print
By pushing inference and training processes of AI models to edge nodes, edge AI has emerged as a promising alternative.  ...  We then introduce communication-efficient techniques, from both algorithmic and system perspectives for training and inference tasks at the network edge.  ...  Zhi Ding from the University of California at Davis for insightful and constructive comments to improve the presentation of this work.  ... 
arXiv:2002.09668v1 fatcat:nhasdzb7t5dt5brs2r7ocdzrnm

Understanding and Optimizing Packed Neural Network Training for Hyper-Parameter Tuning [article]

Rui Liu, Sanjay Krishnan, Aaron J. Elmore, Michael J. Franklin
2021 arXiv   pre-print
of the pack primitive largely depends on a number of factors including memory capacity, chip architecture, neural network structure, and batch size; (3) there exists a trade-off between packing and unpacking  ...  when training multiple neural network models on limited resources; (4) a pack-aware Hyperband is up to 2.7x faster than the original Hyperband, with this improvement growing as memory size increases and  ...  (2) The benefits of the pack primitive largely depend on a number of factors including memory capacity, chip architecture, neural network structure, batch size, and data preprocessing overlap. (3) There  ... 
arXiv:2002.02885v4 fatcat:bzrbaltbzzfnvbsrelouhqkxoq

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up [article]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
2020 arXiv   pre-print
As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets.  ...  Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable  ...  For each experiment run, we begin with an SLO of 2.9 ms (1× the execution latency of batch-1 ResNet50 inference).  ... 
arXiv:2006.02464v2 fatcat:f7quwroge5hmxpw66a5oujqhhu

MLPerf Tiny Benchmark [article]

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, Danilo Pau, Urmish Thakker, Antonio Torrini (+10 others)
2021 arXiv   pre-print
MLPerf Tiny measures the accuracy, latency, and energy of machine learning inference to properly evaluate the tradeoffs between systems.  ...  Additionally, MLPerf Tiny implements a modular design that enables benchmark submitters to show the benefits of their product, regardless of where it falls on the ML deployment stack, in a fair and reproducible  ...  The dataset is divided into five training batches and one testing batch, each with 10000 images.  ... 
arXiv:2106.07597v4 fatcat:ps4y36uq4nevxfbe7p3tne4opu

Fast convolutional neural networks on FPGAs with hls4ml [article]

Thea Aarrestad, Vladimir Loncar, Nicolò Ghielmetti, Maurizio Pierini, Sioni Summers, Jennifer Ngadiuba, Christoffer Petersson, Hampus Linander, Yutaro Iiyama, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris (+8 others)
2021 arXiv   pre-print
used in trigger and data acquisition systems of particle detectors.  ...  We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on FPGAs.  ...  targeting Xilinx system-on-chips (SoCs).  ... 
arXiv:2101.05108v2 fatcat:3prsiiuypjew5lovb3kwv5gviq

Deep Learning for Mobile Multimedia

Kaoru Ota, Minh Son Dao, Vasileios Mezaris, Francesco G. B. De Natale
2017 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
, speech-to-text translation, media information retrieval, multi-modal data analysis, and so on.  ...  bandwidth, and so on. is brings the need of more e cient DNN technologies, which can cope with the constraints of mobile multimedia.  ...  In a cuDNN implementation, NVIDIA focuses on on-chip memory and processing since o -chip memory is much expensive. e authors of [29] implement input fetching to hide the memory latency with the data  ... 
doi:10.1145/3092831 fatcat:ez2fcgckhjawlfywyecest4jqy

Applications and Techniques for Fast Machine Learning in Science [article]

Allison McCarn Deiana, Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Scott Hauck, Mia Liu, Mark S. Neubauer, Jennifer Ngadiuba, Seda Ogrenci-Memik, Maurizio Pierini (+74 others)
2021 arXiv   pre-print
This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions.  ...  The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for  ...  KEY AREAS OF OVERLAP Real-time, accelerated AI inference show promises in improving the discovery potential at current and planned scientific instruments across the domains as detailed in Sec. 2.  ... 
arXiv:2110.13041v1 fatcat:cvbo2hmfgfcuxi7abezypw2qrm

Applying CNN on a scientific application accelerator based on dataflow architecture

Xiaochun Ye, Taoran Xiang, Xu Tan, Yujing Feng, Haibin Wu, Meng Wu, Dongrui Fan
2019 CCF Transactions on High Performance Computing  
The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely 2.29 × higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware  ...  However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption.  ...  Without data dependency between contexts, multiple contexts in loop-in-pipeline mode execute on the PEs in a pipelined and paralleled manner.  ... 
doi:10.1007/s42514-019-00015-7 fatcat:4n5kyzorsfdvph3uuvaaz65chi

Accelerating Spike-by-Spike Neural Networks on FPGA with Hybrid Custom Floating-Point and Logarithmic Dot-Product Approximation

Yarib Nevarez, David Rotermund, Klaus R. Pawelzik, Alberto Garcia-Ortiz
2021 IEEE Access  
ACKNOWLEDGMENTS This work is funded by the Consejo Nacional de Ciencia y Tecnologia -CONACYT (the Mexican National Council for Science and Technology).  ...  values as outputs; the second one is represented by more complex architectures as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) using continuous activation functions; while the  ...  INTRODUCTION T HE exponential improvement in computing performance and the availability of large amounts of data are boosting the use of artificial intelligence (AI) applications in our daily lives.  ... 
doi:10.1109/access.2021.3085216 fatcat:dxvv2cvc5zdv5hxhwe2wew2wsi

Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights [article]

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, Baoxin Li
2021 arXiv   pre-print
structured sparsity can improve storage efficiency and balance computations; understanding how to compile and map models with sparse tensors on the accelerators; understanding recent design trends for  ...  This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators.  ...  With a boost in energy-efficient accelerations of the learning and inference at the cloud and edge, they can be anticipated to further improve the intelligence of various systems or applications.  ... 
arXiv:2007.00864v2 fatcat:k4o2xboh4vbudadfiriiwjp7uu

GPU-Based Embedded Intelligence Architectures and Applications

Li Minn Ang, Kah Phooi Seng
2021 Electronics  
This paper gives a comprehensive review and representative studies of the emerging and current paradigms for GPU-based EI with the focus on the architecture, technologies and applications: (1) First, the  ...  overview and classifications of GPU-based EI research are presented to give the full spectrum in this area that also serves as a concise summary of the scope of the paper; (2) Second, various architecture  ...  GPUs have been utilized to function as hardware accelerators [3] in speeding up training and inference of machine learning, deep learning and AI.  ... 
doi:10.3390/electronics10080952 fatcat:paubm2sevbhixi2in63ayflmti

An Overview of Machine Learning within Embedded and Mobile Devices–Optimizations and Applications

Taiwo Samuel Ajani, Agbotiname Lucky Imoize, Aderemi A. Atayero
2021 Sensors  
Additionally, we discuss the implementation of these algorithms in microcontroller units, mobile devices, and hardware accelerators.  ...  Embedded systems technology is undergoing a phase of transformation owing to the novel advancements in computer architecture and the breakthroughs in machine learning applications.  ...  Memory Footprint The available on-chip and off-chip memory in embedded systems are very limited compared to the size of ML parameters (synapses and activations) [27] .  ... 
doi:10.3390/s21134412 pmid:34203119 fatcat:dxmshp4frnf4pcookdy3wjl4fi
« Previous Showing results 1 — 15 out of 108 results