154 Hits in 5.6 sec

LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference [article]

Erwei Wang, James J. Davis, Peter Y. K. Cheung, George A. Constantinides
2020 arXiv   pre-print
Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference  ...  For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable  ...  During inference, the most common-and expensivecomputational node in a DNN performs a function of the form in (1), calculating a channel output y.  ... 
arXiv:1910.12625v2 fatcat:hyrkn7hmuzfh5g4xpivwg46oje

Learned Hardware/Software Co-Design of Neural Accelerators [article]

Zhan Shi, Chirag Sakhuja, Milad Hashemi, Kevin Swersky, Calvin Lin
2020 arXiv   pre-print
This paper instead casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space.  ...  Because the design space of deep learning software stacks and hardware accelerators is diverse and vast, prior work considers software optimizations separately from hardware architectures, effectively  ...  a nested approach for co-optimizing hardware/software parameters.  ... 
arXiv:2010.02075v1 fatcat:fbri7ktvinhdraq6krswlealdi

SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference [article]

Jude Haris, Perry Gibson, José Cano, Nicolas Bohm Agostini, David Kaeli
2021 arXiv   pre-print
In this paper we propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized DNN inference accelerators on edge devices with FPGAs.  ...  We quickly and iteratively explore the system's hardware/software stack, while identifying and mitigating performance bottlenecks.  ...  span several levels of the hardware/software stack to run efficiently [4] .  ... 
arXiv:2110.00478v1 fatcat:qw7lh7tyzvflrpw56wamxcynki

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers [article]

Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, Anand Raghunathan
2021 arXiv   pre-print
As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers.  ...  This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations.  ...  We also detail how these units may be integrated in an existing DNN inference accelerator. A.  ... 
arXiv:2103.09301v1 fatcat:mnrcs6wjefaw5pmndpv3ucb6nm

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration [article]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt (+7 others)
2021 arXiv   pre-print
To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator.  ...  Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level  ...  Here, we discuss the hardware, software, and system level requirements for DNN accelerator generators to enable full-stack, systematic DNN architecture evaluation.  ... 
arXiv:1911.09925v3 fatcat:yftbmax3c5dqtfvovhyz57oihy

Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems [article]

Shail Dave, Alberto Marchisio, Muhammad Abdullah Hanif, Amira Guesmi, Aviral Shrivastava, Ihsen Alouani, Muhammad Shafique
2022 arXiv   pre-print
Privacy concerns are also becoming a first-order issue.  ...  Apart from high efficiency requirements, modern ML systems are expected to be highly reliable against hardware failures as well as secure against adversarial and IP stealing attacks.  ...  Then, we present an agile design methodology for obtaining efficient hardware/software codesigns for NPUs, along with automating full-stack development for a broad set of NPU architectures. A.  ... 
arXiv:2204.09514v1 fatcat:ho7auszvmferrn36evs7oqdpt4

SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads [article]

Sam Likun Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, David Brooks
2019 arXiv   pre-print
SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration.  ...  We find that for overall single-batch inference latency, the accelerator may only make up 25-40%, with the rest spent on data movement and in the deep learning software framework.  ...  SMAUG is designed to enable DNN researchers to rapidly evaluate different accelerator and SoC designs and perform hardware-software co-design, not to replace existing frameworks.  ... 
arXiv:1912.04481v2 fatcat:akewc2b7xvbm7malvjxcx6xj2i

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning [article]

Youngeun Kwon, Yunjae Lee, Minsoo Rhu
2019 arXiv   pre-print
We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-data processing cores tailored for DL tensor operations.  ...  These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion.  ...  Second, our proposal covers multiple levels in the hardware/software stack, so a cycle-level hardware performance model of TensorDIMM and TensorNode alone will not properly reflect the complex interaction  ... 
arXiv:1908.03072v2 fatcat:yiwl72jovnhkniwtn6cg3owdfy

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference [article]

Yujeong Choi, Yunseong Kim, Minsoo Rhu
2020 arXiv   pre-print
Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel.  ...  In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership.  ...  prototype implementation that LazyBatching can readily be implemented on top of existing hardware/software stack.  ... 
arXiv:2010.13103v1 fatcat:4kvl5wcvxvfg3a6pvx3ulun2ki

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training [article]

Youngeun Kwon, Yunjae Lee, Minsoo Rhu
2020 arXiv   pre-print
We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives  ...  As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures.  ...  Because of its highly deterministic dataflow, these dense DNN algorithms are amenable for accelerated computation using custom-designed architectures for both training and inference [3] , [9] , [11]  ... 
arXiv:2010.13100v1 fatcat:kt7vrmg7ezhijgdsvoqjywwkye

You Only Search Once: A Fast Automation Framework for Single-Stage DNN/Accelerator Co-design [article]

Weiwei Chen
2020 arXiv   pre-print
DNN/Accelerator co-design has shown great potential in improving QoR and performance.  ...  Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics.  ...  The proposed approaches allow us to identify approximately 10 6 highly relevant hardware/software implementations from 10 15 possible solutions in few hours, so that the problem of complete algorithm-hardware  ... 
arXiv:2005.07075v1 fatcat:vrtwedtzdzbnvdjz2frpjhwkbe

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations [article]

Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu
2020 arXiv   pre-print
Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm.  ...  We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers.  ...  We thank Jaewoong Sim and Intel Labs for giving us access to the CPU+FPGA system through the Hardware Accelerator Research Program (HARP). We also thank the anonymous  ... 
arXiv:2005.05968v1 fatcat:ko6c4blrsnez7awjmm3ptwgg5a

TEA-DNN: the Quest for Time-Energy-Accuracy Co-optimized Deep Neural Networks [article]

Lile Cai, Anne-Maelle Barneche, Arthur Herbout, Chuan Sheng Foo, Jie Lin, Vijay Ramaseshan Chandrasekhar, Mohamed M. Sabry
2019 arXiv   pre-print
Second, there has been increasing interest in developing hardware accelerators for CNNs that provide improved inference performance and energy consumption compared to GPUs.  ...  We apply TEA-DNN for image classification on actual embedded platforms (NVIDIA Jetson TX2 and Intel Movidius Neural Compute Stick).  ...  ACKNOWLEDGEMENTS This research is supported by A*STAR under its Hardware-Software Co-optimisation for Deep Learning (Project No.A1892b0026).  ... 
arXiv:1811.12065v2 fatcat:ojne56etxreqza4vpsk5oza3ee


Caiwen Ding, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, Bo Yuan, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu (+4 others)
2017 Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-50 '17  
The CirCNN architecture, a universal DNN inference engine that can be implemented on various hardware/software platforms with configurable network architecture.  ...  For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency.  ...  Based on block-circulant matrix-based algorithms, we propose CCNN architecture, -a universal DNN inference engine that can be implemented in various hardware/software platforms with congurable network  ... 
doi:10.1145/3123939.3124552 dblp:conf/micro/DingLWLLZWQBYMZ17 fatcat:yghqzgu65feuzjujvhvx2penie

DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs [article]

Alessio Burrello, Angelo Garofalo, Nazareno Bruschi, Giuseppe Tagliavini, Davide Rossi, Francesco Conti
2021 arXiv   pre-print
As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market.  ...  Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746.  ...  ACKNOWLEDGEMENT The authors thank Daniele and Margot Palossi for their help in setting up the RocketLogger to obtain GAP8 power traces. 15 .  ... 
arXiv:2008.07127v2 fatcat:ne45ygirkjfylbvolg7zvfitqi
« Previous Showing results 1 — 15 out of 154 results