A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors
[article]
2021
arXiv
pre-print
For this reason, optimized hardware accelerators are used to increase the performance of the inference of neuronal networks. ...
In particular, we focus on acceleration of the inference of convolutional neural networks (CNNs) used for image recognition tasks. Given that there exist many different hardware architectures. ...
However, cutting off all negative neuron outputs leads to a loss of information. For this reason leaky ReLU is used in order to additionally consider negative neuron outputs [4] . ...
arXiv:2104.09252v1
fatcat:625wtuskhff3lbswhwmj7decni
Randomness In Neural Network Training: Characterizing The Impact of Tooling
[article]
2021
arXiv
pre-print
However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to 746%, 241%, and 196% on a spectrum of ...
widely used GPU accelerator architectures, relative to non-deterministic training. ...
The high level of IMPL despite the systolic design appears to be due to the reliance of Tensor Cores on non-deterministic CUDA cores on GPU for computations that not supported. ...
arXiv:2106.11872v1
fatcat:7shulcpu3bb3tikwl3d3naweku
Parallel Algorithms for Constrained Tensor Factorization via Alternating Direction Method of Multipliers
2015
IEEE Transactions on Signal Processing
on regular high-performance computing (e.g., mesh) architectures. ...
With few recent exceptions, all tensor factorization algorithms were originally developed for centralized, in-memory computation on a single machine; and the few that break away from this mold do not easily ...
NON-NEGATIVE TENSOR FACTORIZATION Let tensor admit a non-negative 3 CP decomposition of order where , , and . ...
doi:10.1109/tsp.2015.2454476
fatcat:z2yhxvgnibd57ne2dkd2hwqhvi
INsight: A Neuromorphic Computing System for Evaluation of Large Neural Networks
[article]
2015
arXiv
pre-print
The computing system consists of a non-conventional compiler, a neuromorphic architecture, and a space-efficient microarchitecture that leverages existing integrated circuit design methodologies. ...
The compiler factorizes a trained, feedforward network into a sparsely connected network, compresses the weights linearly, and generates a time delay neural network reducing the number of connections. ...
Applying a tensor decomposition method to the weight tensor, a layer can be factorized into 5 sublayers. ...
arXiv:1508.01008v1
fatcat:noi45s26vfbdnoqpn42cgxjqba
Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights
[article]
2021
arXiv
pre-print
their efficient computations; analyzing trade-offs in opting for a specific design choice for encoding, storing, extracting, communicating, computing, and load-balancing the non-zeros; understanding how ...
This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators. ...
Hardware accelerators can efficiently process tensor computations of ML models. In particular, coarse-grain spatial architectures are a common choice for hardware accelerator designs. ...
arXiv:2007.00864v2
fatcat:k4o2xboh4vbudadfiriiwjp7uu
Agile Autotuning of a Transprecision Tensor Accelerator Overlay for TVM Compiler Stack
[article]
2020
arXiv
pre-print
Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning ...
Programmable tensor accelerators offer a promising alternative by allowing reconfiguration of a virtual architecture that overlays on top of the physical FPGA configurable fabric. ...
Abstract-Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning ...
arXiv:2004.10854v1
fatcat:i4gkoxo2nbdf5l3wo5cln77lsy
Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks
[article]
2017
arXiv
pre-print
Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference. ...
Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. ...
Our discovery suggests a potential gain in efficiency and performance of future hardware architectures specialized in deep neural network training. ...
arXiv:1711.02213v2
fatcat:o6d2myc4ovgtfkknowwbzss3gq
Computation on Sparse Neural Networks: an Inspiration for Future Hardware
[article]
2020
arXiv
pre-print
We observe that the search for the sparse structure can be a general methodology for high-quality model explorations, in addition to a strategy for high-efficiency model execution. ...
Thus, finding better model architectures with much less amount of computation while maximally preserving the accuracy is a popular research topic. ...
Thus, the same model architecture can be applied to both edges (when the scaling factor is small) and servers (when the scaling factor is large). ...
arXiv:2004.11946v1
fatcat:2lnbtmi4grb65nxcxab4kz6pvy
ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks
[article]
2017
arXiv
pre-print
In this paper we introduce ShiftCNN, a generalized low-precision architecture for inference of multiplierless convolutional neural networks (CNNs). ...
ShiftCNN is based on a power-of-two weight representation and, as a result, performs only shift and addition operations. ...
The proposed architecture combines several unique features. First, it employs a hardware-efficient power-of-two weight representation which requires performing only shift and addition operations. ...
arXiv:1706.02393v1
fatcat:s67d4y43zrap3du5p3m6zckkva
Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis
[article]
2020
arXiv
pre-print
SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate ...
if the parameter defaults in SparTen are appropriate for general tensor data. ...
Acknowledgment We would like to thank Richard Barrett for assistance utilizing computing resources at Sandia National Laboratories and Rich Lehoucq for comments of support. ...
arXiv:2012.01520v1
fatcat:uar4bs4iynd45ka2ygwbznsvha
Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs
2019
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2× and 10× using a Tesla V100 GPU. ...
We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. ...
While the Pascal GPU architecture introduced hardware support for FP16 arithmetic, the Volta architecture, which powers the Summit supercomputer, 3 comes with hardware acceleration units (called Tensor ...
doi:10.1109/ipdps.2019.00022
dblp:conf/ipps/AbdelfattahTD19
fatcat:onwad4dv7faslikjgy7sf2wda4
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
[article]
2022
arXiv
pre-print
While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled. ...
This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. ...
TensorRT [22] is a high-performance inference library developed by NVIDIA. The quantiza- tion scheme in TensorRT is symmetric per-channel for weights, and symmetric per-tensor for activations. ...
arXiv:2111.03759v2
fatcat:k5woc6az6zfgpcgx5dxgzhkfou
StereoSpike: Depth Learning with a Spiking Neural Network
[article]
2021
arXiv
pre-print
We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. ...
Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that ...
We would like to thank Amirreza Yousefzadeh for his help and expertise on digital neuromorphic hardware. ...
arXiv:2109.13751v2
fatcat:hcghwoxegzgeljc5oj7zb6sa5m
Task-Based Programming for Seismic Imaging: Preliminary Results
2014
2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)
The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms ...
The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. ...
The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms ...
doi:10.1109/hpcc.2014.205
dblp:conf/hpcc/BoillotBAC14
fatcat:qi23cnnewndmlj6idxnw7hpxta
InSight: An FPGA-Based Neuromorphic Computing System for Deep Neural Networks
2020
Journal of Low Power Electronics and Applications
The computing system consists of a non-conventional compiler, a neuromorphic hardware architecture, and a space-efficient microarchitecture that leverages existing integrated circuit design methodologies ...
This paper describes a neuromorphic computing system that is designed from the ground up for energy-efficient evaluation of deep neural networks. ...
For each convolutional layer, we perform tensor factorization. For each fully-connected layer, we perform matrix factorization (Step 1). Both are based on singular value decomposition (SVD). ...
doi:10.3390/jlpea10040036
fatcat:qgl2htptxrfdpoyuufsnqm6obq
« Previous
Showing results 1 — 15 out of 6,721 results