4,524 Hits in 4.6 sec

Accelerating Deep Learning Inference via Learned Caches [article]

Arjun Balasubramanian, Adarsh Kumar, Yuhan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella
2021 arXiv   pre-print
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference.  ...  However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated.  ...  Per our knowledge, GATI is the first work to use ML models as caches to accelerate DNN inference.  ... 
arXiv:2101.07344v1 fatcat:cgpq66oh45g7zhi6ayhxhkspnq

Does Form Follow Function? An Empirical Exploration of the Impact of Deep Neural Network Architecture Design on Hardware-Specific Acceleration [article]

Saad Abbasi, Mohammad Javad Shafiee, Ellick Chan, Alexander Wong
2021 arXiv   pre-print
Finally, we analyze the inference time reductions using hardware-specific acceleration when compared to native deep learning frameworks across a wide variety of hand-crafted deep convolutional neural network  ...  In this study, a comprehensive empirical exploration is conducted to investigate the impact of deep neural network architecture design on the degree of inference speedup that can be achieved via hardware-specific  ...  CONCLUSION The efficient design and acceleration of deep neural networks are key for widespread deployment of deep learning applications.  ... 
arXiv:2107.04144v1 fatcat:pqluydcz5rhpxm6pbu527nvaji

Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, Kunlun Li
2022 Sixteenth ACM Conference on Recommender Systems  
In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks.  ...  Merlin HugeCTR is an open source, GPU-accelerated integration framework for clickthrough rate estimation.  ...  Learning (WDL; [3] ), Deep Cross Network (DCN; [24] ), DeepFM [8] , and Deep Learning Recommendation Model (DLRM; [16] ).  ... 
doi:10.1145/3523227.3547405 fatcat:qirn3sub4vecnnonuudyy26yhq

2021 Index IEEE Transactions on Parallel and Distributed Systems Vol. 32

2022 IEEE Transactions on Parallel and Distributed Systems  
Balaji, P., +, TPDS July 2021 1511-1512 Deep learning (artificial intelligence) Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs.  ...  ., +, TPDS July 2021 1765-1776 Petrel: Heterogeneity-Aware Distributed Deep Learning Via Hybrid Synchronization.  ... 
doi:10.1109/tpds.2021.3107121 fatcat:e7bh2xssazdrjcpgn64mqh4hb4


Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, Xuanzhe Liu
2018 Proceedings of the 24th Annual International Conference on Mobile Computing and Networking - MobiCom '18  
We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision.  ...  Our implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices.  ...  DeepCache works as a lightweight extension to a commodity deep learning inference engine.  ... 
doi:10.1145/3241539.3241563 dblp:conf/mobicom/XuZLLL18 fatcat:w4pwzh3trjeb5ealpwq4i6c4ea

C++ Code Generation for Fast Inference of Deep Learning Models in ROOT/TMVA

Sitong An, Lorenzo Moneta, C. Biscarat, S. Campana, B. Hegner, S. Roiser, C.I. Rovelli, G.A. Stewart
2021 EPJ Web of Conferences  
We report the latest development in ROOT/TMVA, a new system that takes trained ONNX deep learning models and emits C++ code that can be easily included and invoked for fast inference of the model, with  ...  We present an overview of the current solutions for conducting inference in C++ production environment, discuss the technical details and examples of the generated code, and demonstrates its development  ...  Together with ONNX, an open source project aiming to accelerate deep learning inference across different frameworks, operating systems and hardware platforms has been developed with the support of Microsoft  ... 
doi:10.1051/epjconf/202125103040 fatcat:ogad6l6z5vc5dc65qu7noocrhm

Special Issue on Artificial-Intelligence-Powered Edge Computing for Internet of Things

Lei Yang, Xu Chen, Samir M. Perlaza, Junshan Zhang
2020 IEEE Internet of Things Journal  
In the article "AWARE-CNN: Automated workflow for application-aware real-time edge acceleration of CNNs," Sanchez et al. presented AWARE-CNN accelerators for realtime execution of deep learning algorithms  ...  in mobile-edge computing servers to minimize end-to-end inference delay of deep learning tasks in multiple DNN partitions.  ... 
doi:10.1109/jiot.2020.3019948 fatcat:mogalqnhnnaqpbxb7zivzdhvry

2021 Index IEEE Computer Architecture Letters Vol. 20

2022 IEEE computer architecture letters  
., Hard- ware Acceleration for GCNs via Bidirectional Fusion; LCA Jan.  ...  ., +, LCA July-Dec. 2021 171-174 H Hardware Learned Performance Model for SSD. Lee, H.G., +, LCA July-Dec. Inference Accelerators.  ... 
doi:10.1109/lca.2022.3141948 fatcat:4q3mjjqd4zdazg57yfor4ym2gi

Edge Intelligence: Architectures, Challenges, and Applications [article]

Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, Pan Hui
2020 arXiv   pre-print
We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed  ...  Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence.  ...  Caching based on such redundancy could effectively reduce computation and accelerate the inference.  ... 
arXiv:2003.12172v2 fatcat:xbrylsvb7bey5idirunacux6pe

Benchmarking Modern Edge Devices for AI Applications

Pilsung KANG, Jongmin JO
2021 IEICE transactions on information and systems  
We perform a set of deep learning benchmarks on the devices to measure their performance.  ...  By comparing the performance with other GPU (graphics processing unit) accelerated systems in different platforms, we assess the computational capability of the modern edge devices featuring a significant  ...  [30] for a comprehensive review on accelerator architectures for deep learning applications.  ... 
doi:10.1587/transinf.2020edp7160 fatcat:4uo7pehd7vbylckgmpoh5s34im

In-Edge AI: Intelligentizing Mobile Edge Computing, Caching and Communication by Federated Learning [article]

Xiaofei Wang, Yiwen Han, Chenyang Wang, Qiyang Zhao, Xu Chen, Min Chen
2019 arXiv   pre-print
Learning techniques and Federated Learning framework with the mobile edge systems, for optimizing the mobile edge computing, caching and communication.  ...  In order to bring more intelligence to the edge systems, compared to traditional optimization methodology, and driven by the current deep learning techniques, we propose to integrate the Deep Reinforcement  ...  Particularly, related studies on edge computing and caching, such as [7] [8] , have shown that reinforcement learning [9] (include Deep Q-Learning [10] in Deep Reinforcement Learning) has the potential  ... 
arXiv:1809.07857v2 fatcat:2sav5fnozbc3rd2atcgpxzg7jq

Cross-Stack Workload Characterization of Deep Recommendation Systems [article]

Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, David Brooks
2020 arXiv   pre-print
Deep learning based recommendation systems form the backbone of most personalized cloud services.  ...  Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory  ...  The main contributions of this paper are: 1) While conventional wisdom and recent research indicates that AI accelerators, such as GPUs, readily accelerate deep learning workloads, this work shows the  ... 
arXiv:2010.05037v1 fatcat:pe7epaic6fgnbarw3zjmnnalsi

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu [article]

Hao Liu, Qian Gao, Jiang Li, Xiaochao Liao, Hao Xiong, Guangxing Chen, Wenlin Wang, Guobao Yang, Zhiwei Zha, Daxiang Dong, Dejing Dou, Haoyi Xiong
2021 arXiv   pre-print
In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video  ...  Besides, JIZHI introduces heterogeneous and hierarchical storage to further accelerate the online inference process by reducing unnecessary computations and potential data access latency induced by ultra-sparse  ...  Serving deep learning based online inference.  ... 
arXiv:2106.01674v1 fatcat:zogjjhrfezc3pkboekw56ebf5y

Boosting Mobile CNN Inference through Semantic Memory [article]

Yun Li, Chen Zhang, Shihao Han, Li Lyna Zhang, Baoqun Yin, Yunxin Liu, Mengwei Xu
2021 arXiv   pre-print
Extensive experiments on large-scale datasets and models show that SMTM can significantly speed up the model inference over standard approach (up to 2X) and prior cache designs (up to 1.5X), with acceptable  ...  For the first time, we borrow and distill such a capability into a semantic memory design, namely SMTM, to improve on-device CNN inference.  ...  Journal of experimental psychology: Learning, Memory, Branchynet: Fast inference via early exiting from deep neural networks.  ... 
arXiv:2112.02644v1 fatcat:gfyecsojvzgaxjgmlwihov26ju

A Hardware-Software Blueprint for Flexible Deep Learning Specialization [article]

Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
2019 arXiv   pre-print
Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility  ...  We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads.  ...  We present VTA, a parametrizable deep learning architecture that is explicitly programmed via a two-level ISA.  ... 
arXiv:1807.04188v3 fatcat:wpafekkrqzffzfe7vulaa6qnva
« Previous Showing results 1 — 15 out of 4,524 results