A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference
[article]
2020
arXiv
pre-print
Inspired by this observation, we propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference ...
For both varieties, we demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable ...
During inference, the most common-and expensivecomputational node in a DNN performs a function of the form in (1), calculating a channel output y. ...
arXiv:1910.12625v2
fatcat:hyrkn7hmuzfh5g4xpivwg46oje
Learned Hardware/Software Co-Design of Neural Accelerators
[article]
2020
arXiv
pre-print
This paper instead casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space. ...
Because the design space of deep learning software stacks and hardware accelerators is diverse and vast, prior work considers software optimizations separately from hardware architectures, effectively ...
a nested approach for co-optimizing hardware/software parameters. ...
arXiv:2010.02075v1
fatcat:fbri7ktvinhdraq6krswlealdi
SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference
[article]
2021
arXiv
pre-print
In this paper we propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized DNN inference accelerators on edge devices with FPGAs. ...
We quickly and iteratively explore the system's hardware/software stack, while identifying and mitigating performance bottlenecks. ...
span several levels of the hardware/software stack to run efficiently [4] . ...
arXiv:2110.00478v1
fatcat:qw7lh7tyzvflrpw56wamxcynki
Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers
[article]
2021
arXiv
pre-print
As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. ...
This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. ...
We also detail how these units may be integrated in an existing DNN inference accelerator.
A. ...
arXiv:2103.09301v1
fatcat:mnrcs6wjefaw5pmndpv3ucb6nm
Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
[article]
2021
arXiv
pre-print
To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. ...
Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level ...
Here, we discuss the hardware, software, and system level requirements for DNN accelerator generators to enable full-stack, systematic DNN architecture evaluation. ...
arXiv:1911.09925v3
fatcat:yftbmax3c5dqtfvovhyz57oihy
Special Session: Towards an Agile Design Methodology for Efficient, Reliable, and Secure ML Systems
[article]
2022
arXiv
pre-print
Privacy concerns are also becoming a first-order issue. ...
Apart from high efficiency requirements, modern ML systems are expected to be highly reliable against hardware failures as well as secure against adversarial and IP stealing attacks. ...
Then, we present an agile design methodology for obtaining efficient hardware/software codesigns for NPUs, along with automating full-stack development for a broad set of NPU architectures.
A. ...
arXiv:2204.09514v1
fatcat:ho7auszvmferrn36evs7oqdpt4
SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads
[article]
2019
arXiv
pre-print
SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. ...
We find that for overall single-batch inference latency, the accelerator may only make up 25-40%, with the rest spent on data movement and in the deep learning software framework. ...
SMAUG is designed to enable DNN researchers to rapidly evaluate different accelerator and SoC designs and perform hardware-software co-design, not to replace existing frameworks. ...
arXiv:1912.04481v2
fatcat:akewc2b7xvbm7malvjxcx6xj2i
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
[article]
2019
arXiv
pre-print
We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-data processing cores tailored for DL tensor operations. ...
These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. ...
Second, our proposal covers multiple levels in the hardware/software stack, so a cycle-level hardware performance model of TensorDIMM and TensorNode alone will not properly reflect the complex interaction ...
arXiv:1908.03072v2
fatcat:yiwl72jovnhkniwtn6cg3owdfy
LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
[article]
2020
arXiv
pre-print
Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. ...
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. ...
prototype implementation that LazyBatching can readily be implemented on top of existing hardware/software stack. ...
arXiv:2010.13103v1
fatcat:4kvl5wcvxvfg3a6pvx3ulun2ki
Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training
[article]
2020
arXiv
pre-print
We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives ...
As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. ...
Because of its highly deterministic dataflow, these dense DNN algorithms are amenable for accelerated computation using custom-designed architectures for both training and inference [3] , [9] , [11] ...
arXiv:2010.13100v1
fatcat:kt7vrmg7ezhijgdsvoqjywwkye
You Only Search Once: A Fast Automation Framework for Single-Stage DNN/Accelerator Co-design
[article]
2020
arXiv
pre-print
DNN/Accelerator co-design has shown great potential in improving QoR and performance. ...
Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. ...
The proposed approaches allow us to identify approximately 10 6 highly relevant hardware/software implementations from 10 15 possible solutions in few hours, so that the problem of complete algorithm-hardware ...
arXiv:2005.07075v1
fatcat:vrtwedtzdzbnvdjz2frpjhwkbe
Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations
[article]
2020
arXiv
pre-print
Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. ...
We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers. ...
We thank Jaewoong Sim and Intel Labs for giving us access to the CPU+FPGA system through the Hardware Accelerator Research Program (HARP). We also thank the anonymous ...
arXiv:2005.05968v1
fatcat:ko6c4blrsnez7awjmm3ptwgg5a
TEA-DNN: the Quest for Time-Energy-Accuracy Co-optimized Deep Neural Networks
[article]
2019
arXiv
pre-print
Second, there has been increasing interest in developing hardware accelerators for CNNs that provide improved inference performance and energy consumption compared to GPUs. ...
We apply TEA-DNN for image classification on actual embedded platforms (NVIDIA Jetson TX2 and Intel Movidius Neural Compute Stick). ...
ACKNOWLEDGEMENTS This research is supported by A*STAR under its Hardware-Software Co-optimisation for Deep Learning (Project No.A1892b0026). ...
arXiv:1811.12065v2
fatcat:ojne56etxreqza4vpsk5oza3ee
CirCNN
2017
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-50 '17
The CirCNN architecture, a universal DNN inference engine that can be implemented on various hardware/software platforms with configurable network architecture. ...
For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. ...
Based on block-circulant matrix-based algorithms, we propose CCNN architecture, -a universal DNN inference engine that can be implemented in various hardware/software platforms with congurable network ...
doi:10.1145/3123939.3124552
dblp:conf/micro/DingLWLLZWQBYMZ17
fatcat:yghqzgu65feuzjujvhvx2penie
DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
[article]
2021
arXiv
pre-print
As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. ...
Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746. ...
ACKNOWLEDGEMENT The authors thank Daniele and Margot Palossi for their help in setting up the RocketLogger to obtain GAP8 power traces. 15 . ...
arXiv:2008.07127v2
fatcat:ne45ygirkjfylbvolg7zvfitqi
« Previous
Showing results 1 — 15 out of 154 results