Filters








1,526 Hits in 6.5 sec

Run-Time Efficient RNN Compression for Inference on Edge Devices [article]

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina
2019 arXiv   pre-print
As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy.  ...  Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time  ...  Thus, there is a need for good compression techniques to enable large RNN models to fit into an edge device or ensure that they run efficiently on devices with smaller caches [16] .  ... 
arXiv:1906.04886v2 fatcat:rwdjt6zs2bgz5oobaxiloz7seu

RNNAccel: A Fusion Recurrent Neural Network Accelerator for Edge Intelligence [article]

Chao-Yang Kao, Huang-Chih Kuo, Jian-Wen Chen, Chiung-Liang Lin, Pin-Han Chen, Youn-Long Lin
2020 arXiv   pre-print
Many edge devices employ Recurrent Neural Networks (RNN) to enhance their product intelligence.  ...  However, the increasing computation complexity poses challenges for performance, energy efficiency and product development time.  ...  Weight Coefficient Compression Energy efficiency is an important metrics for edge devices. Previous study [14] shows that memory access usually dominates edge AI inference power consumption.  ... 
arXiv:2010.13311v1 fatcat:xgsm7qbifvesbih4o3rbrnpo4a

Rank and run-time aware compression of NLP Applications [article]

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina
2020 arXiv   pre-print
As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy.  ...  Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints.  ...  This paper focuses on inference on an edge device.  ... 
arXiv:2010.03193v1 fatcat:hwftkhffl5b4tnigr5ugfla4ty

EdgeDRNN: Recurrent Neural Network Accelerator for Edge Inference

Chang Gao, Antonio Rios-Navarro, Xi Chen, Shih-Chii Liu, Tobi Delbruck
2020 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
We propose a lightweight Gated Recurrent Unit (GRU)-based RNN accelerator called EdgeDRNN that is optimized for low-latency edge RNN inference with batch size of 1.  ...  Low-latency, low-power portable recurrent neural network (RNN) accelerators offer powerful inference capabilities for real-time applications such as IoT, robotics, and human-machine interaction.  ...  Because of our interest in real-time edge applications, our focus is on supporting lowlatency batch-1 inference of large RNNs for real-time performance on low-cost and low power but very constrained edge  ... 
doi:10.1109/jetcas.2020.3040300 fatcat:6po265l6rrh4zd35ymyzvj3mce

Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing

Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, Junshan Zhang
2019 Proceedings of the IEEE  
To this end, we conduct a comprehensive survey of the recent research efforts on EI. Specifically, we first review the background and motivation for AI running at the network edge.  ...  We then provide an overview of the overarching architectures, frameworks, and emerging key technologies for deep learning model toward training/inference at the network edge.  ...  Based on this observation, Lin et al. propose deep gradient compression (DGC), which compresses the gradient by 270-600× for a wide range of CNNs and RNNs.  ... 
doi:10.1109/jproc.2019.2918951 fatcat:d53vxmklgfazbmzjhsq3tuoama

Compression of recurrent neural networks for efficient language modeling

Artem M. Grachev, Dmitry I. Ignatov, Andrey V. Savchenko
2019 Applied Soft Computing  
We focus on effective compression methods in the context of their exploitation on devices: pruning, quantization, and matrix decomposition approaches (low-rank factorization and tensor train decomposition  ...  However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications.  ...  The authors would like to thank Dmitriy Polubotko for his valuable help with the experiments on mobile devices.  ... 
doi:10.1016/j.asoc.2019.03.057 fatcat:msw6p77rlfamxc2xwvbl7eyk6q

Machine Learning at the Network Edge: A Survey [article]

M.G. Sarwar Murshed, Christopher Murphy, Daqing Hou, Nazar Khan, Ganesh Ananthanarayanan, Faraz Hussain
2021 arXiv   pre-print
This has led to the generation of large quantities of data in real-time, which is an appealing target for AI systems.  ...  This survey describes major research efforts where machine learning systems have been deployed at the edge of computer networks, focusing on the operational aspects including compression techniques, tools  ...  The SparkFun Edge is a real-time audio analysis device, which runs ML inference to detect a keyword, for example, "yes" and responds accordingly 29 .  ... 
arXiv:1908.00080v4 fatcat:mw4lwwvzf5gupjr6pgdgnabeuu

Efficient CNN-LSTM based Image Captioning using Neural Network Compression [article]

Harshit Rampal, Aman Mohanty
2020 arXiv   pre-print
However, they are notorious for their voracious memory and compute appetite which further obstructs their deployment on resource limited edge devices.  ...  We then examine the effects of different compression architectures on the model and design a compression architecture that achieves a 73.1% reduction in model size, 71.3% reduction in inference time and  ...  The compressed model achieves impressive storage efficiency coupled with a respectable reduction in inference time.  ... 
arXiv:2012.09708v1 fatcat:uctapqsspfe25j3jgh4y52c7ta

Table of contents

2019 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2)  
EMC2 2019 Table of Contents EMC2 2019 Preface vii EMC2 2019 Review Committee viii EMC2 2019 Invited Talk Abstracts ix Technical Papers of Run-Time Efficient RNN Compression for Inference on Edge Devices  ...  Merging MobileNets for Efficient Multitask Inference 16 Cheng-En Wu (Academia Sinica), Yi-Ming Chan (Academia Sinica), and Chu-Song Chen (Academia Sinica) Integrating NVIDIA Deep Learning Accelerator  ... 
doi:10.1109/emc249363.2019.00004 fatcat:crezgdiz3ze5dor3l4i4odnlem

Efficient Deep Learning Inference Based on Model Compression

Qing Zhang, Mengru Zhang, Mengdi Wang, Wanchen Sui, Chen Meng, Jun Yang, Weidan Kong, Xiaoyuan Cui, Wei Lin
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)  
Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this makes efficient deployment challenging, especially in devices  ...  We use different modeling scenarios to test our inference optimization pipeline with above mentioned methods, and it shows promising results to make inference more efficient with marginal loss of model  ...  We used the practical SR as an additional indicator of inference efficiency, since the memory access and movement time are not considered in FLOPs.  ... 
doi:10.1109/cvprw.2018.00221 dblp:conf/cvpr/ZhangZWSMYKCL18 fatcat:l3tchfxwkratvogotbhyt7jdoi

C-NMT: A Collaborative Inference Framework for Neural Machine Translation [article]

Yukai Chen, Roberta Chiaro, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari
2022 arXiv   pre-print
Collaborative Inference (CI) optimizes the latency and energy consumption of deep learning inference through the inter-operation of edge and cloud devices.  ...  Albeit beneficial for other tasks, CI has never been applied to the sequence- to-sequence mapping problem at the heart of Neural Machine Translation (NMT).  ...  Both devices run Linux and perform inference with PyTorch.  ... 
arXiv:2204.04043v1 fatcat:weo2rv6whfg5timsbyfkh3ewuy

Edge Intelligence: Architectures, Challenges, and Applications [article]

Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, Pan Hui
2020 arXiv   pre-print
Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence.  ...  We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed  ...  The main idea of model acceleration in inference is to reduce the run-time of inference on edge devices and realise real-time responses for specific neural network based applications without changing the  ... 
arXiv:2003.12172v2 fatcat:xbrylsvb7bey5idirunacux6pe

Hello Edge: Keyword Spotting on Microcontrollers [article]

Yundong Zhang, Naveen Suda, Liangzhen Lai, Vikas Chandra
2018 arXiv   pre-print
Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience.  ...  In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers.  ...  We would also like to thank Pete Warden from Google's TensorFlow team for his valuable inputs and feedback on this project.  ... 
arXiv:1711.07128v3 fatcat:swrltzaqc5hvjay7ofrx3r4lwy

Collaborative learning between cloud and end devices

Yan Lu, Yuanchao Shu, Xu Tan, Yunxin Liu, Mengyu Zhou, Qi Chen, Dan Pei
2019 Proceedings of the 4th ACM/IEEE Symposium on Edge Computing - SEC '19  
Colla finds a middle ground to build tailored model for each device, leveraging local data and computation resources to update the model, while at the same time exploits cloud to aggregate and transfer  ...  Our experiments also validate the efficiency of Colla, showing that one overnight training on a commodity smartphone can process one-year data from a typical smartphone, at the cost of 2000mWh and few  ...  Model Compression and Grouping To reduce the overhead of running model inference on resourcelimited end devices and consider the behavior diversity of different groups of users, Colla uses model compression  ... 
doi:10.1145/3318216.3363304 dblp:conf/edge/LuSTLZCP19 fatcat:caowfqxmazfypn46rvw4em4uiy

Amortized Neural Networks for Low-Latency Speech Recognition [article]

Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow
2021 arXiv   pre-print
We present results using both architectures on LibriSpeech data and show that our proposed architecture can reduce inference cost by up to 45\% and latency to nearly real-time without incurring a loss  ...  The AmNets RNN-T architecture enables the network to dynamically switch between encoder branches on a frame-by-frame basis.  ...  Note that when designing a model for on-device inference, one would know the compute performance a priori and would instead design encoder branch architectures based on device capabilities.  ... 
arXiv:2108.01553v1 fatcat:uci5hioqhbenvmjqmkjit624wa
« Previous Showing results 1 — 15 out of 1,526 results