Filters








6,721 Hits in 5.3 sec

Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors [article]

Lukas Baischer, Matthias Wess, Nima TaheriNejad
2021 arXiv   pre-print
For this reason, optimized hardware accelerators are used to increase the performance of the inference of neuronal networks.  ...  In particular, we focus on acceleration of the inference of convolutional neural networks (CNNs) used for image recognition tasks. Given that there exist many different hardware architectures.  ...  However, cutting off all negative neuron outputs leads to a loss of information. For this reason leaky ReLU is used in order to additionally consider negative neuron outputs [4] .  ... 
arXiv:2104.09252v1 fatcat:625wtuskhff3lbswhwmj7decni

Randomness In Neural Network Training: Characterizing The Impact of Tooling [article]

Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, Sara Hooker
2021 arXiv   pre-print
However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to 746%, 241%, and 196% on a spectrum of  ...  widely used GPU accelerator architectures, relative to non-deterministic training.  ...  The high level of IMPL despite the systolic design appears to be due to the reliance of Tensor Cores on non-deterministic CUDA cores on GPU for computations that not supported.  ... 
arXiv:2106.11872v1 fatcat:7shulcpu3bb3tikwl3d3naweku

Parallel Algorithms for Constrained Tensor Factorization via Alternating Direction Method of Multipliers

Athanasios P. Liavas, Nicholas D. Sidiropoulos
2015 IEEE Transactions on Signal Processing  
on regular high-performance computing (e.g., mesh) architectures.  ...  With few recent exceptions, all tensor factorization algorithms were originally developed for centralized, in-memory computation on a single machine; and the few that break away from this mold do not easily  ...  NON-NEGATIVE TENSOR FACTORIZATION Let tensor admit a non-negative 3 CP decomposition of order where , , and .  ... 
doi:10.1109/tsp.2015.2454476 fatcat:z2yhxvgnibd57ne2dkd2hwqhvi

INsight: A Neuromorphic Computing System for Evaluation of Large Neural Networks [article]

Jaeyong Chung, Taehwan Shin, Yongshin Kang
2015 arXiv   pre-print
The computing system consists of a non-conventional compiler, a neuromorphic architecture, and a space-efficient microarchitecture that leverages existing integrated circuit design methodologies.  ...  The compiler factorizes a trained, feedforward network into a sparsely connected network, compresses the weights linearly, and generates a time delay neural network reducing the number of connections.  ...  Applying a tensor decomposition method to the weight tensor, a layer can be factorized into 5 sublayers.  ... 
arXiv:1508.01008v1 fatcat:noi45s26vfbdnoqpn42cgxjqba

Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights [article]

Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, Baoxin Li
2021 arXiv   pre-print
their efficient computations; analyzing trade-offs in opting for a specific design choice for encoding, storing, extracting, communicating, computing, and load-balancing the non-zeros; understanding how  ...  This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of ML models on hardware accelerators.  ...  Hardware accelerators can efficiently process tensor computations of ML models. In particular, coarse-grain spatial architectures are a common choice for hardware accelerator designs.  ... 
arXiv:2007.00864v2 fatcat:k4o2xboh4vbudadfiriiwjp7uu

Agile Autotuning of a Transprecision Tensor Accelerator Overlay for TVM Compiler Stack [article]

Dionysios Diamantopoulos, Burkhard Ringlein, Mitra Purandare, Gagandeep Singh, Christoph Hagleitner
2020 arXiv   pre-print
Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning  ...  Programmable tensor accelerators offer a promising alternative by allowing reconfiguration of a virtual architecture that overlays on top of the physical FPGA configurable fabric.  ...  Abstract-Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning  ... 
arXiv:2004.10854v1 fatcat:i4gkoxo2nbdf5l3wo5cln77lsy

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks [article]

Urs Köster, Tristan J. Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William H. Constable, Oğuz H. Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai (+1 others)
2017 arXiv   pre-print
Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.  ...  Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning.  ...  Our discovery suggests a potential gain in efficiency and performance of future hardware architectures specialized in deep neural network training.  ... 
arXiv:1711.02213v2 fatcat:o6d2myc4ovgtfkknowwbzss3gq

Computation on Sparse Neural Networks: an Inspiration for Future Hardware [article]

Fei Sun, Minghai Qin, Tianyun Zhang, Liu Liu, Yen-Kuang Chen, Yuan Xie
2020 arXiv   pre-print
We observe that the search for the sparse structure can be a general methodology for high-quality model explorations, in addition to a strategy for high-efficiency model execution.  ...  Thus, finding better model architectures with much less amount of computation while maximally preserving the accuracy is a popular research topic.  ...  Thus, the same model architecture can be applied to both edges (when the scaling factor is small) and servers (when the scaling factor is large).  ... 
arXiv:2004.11946v1 fatcat:2lnbtmi4grb65nxcxab4kz6pvy

ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks [article]

Denis A. Gudovskiy, Luca Rigazio
2017 arXiv   pre-print
In this paper we introduce ShiftCNN, a generalized low-precision architecture for inference of multiplierless convolutional neural networks (CNNs).  ...  ShiftCNN is based on a power-of-two weight representation and, as a result, performs only shift and addition operations.  ...  The proposed architecture combines several unique features. First, it employs a hardware-efficient power-of-two weight representation which requires performing only shift and addition operations.  ... 
arXiv:1706.02393v1 fatcat:s67d4y43zrap3du5p3m6zckkva

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis [article]

Jeremy M. Myers, Daniel M. Dunlavy, Keita Teranishi, D. S. Hollman
2020 arXiv   pre-print
SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate  ...  if the parameter defaults in SparTen are appropriate for general tensor data.  ...  Acknowledgment We would like to thank Richard Barrett for assistance utilizing computing resources at Sandia National Laboratories and Rich Lehoucq for comments of support.  ... 
arXiv:2012.01520v1 fatcat:uar4bs4iynd45ka2ygwbznsvha

Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

Ahmad Abdelfattah, Stanimire Tomov, Jack Dongarra
2019 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)  
Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2× and 10× using a Tesla V100 GPU.  ...  We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs.  ...  While the Pascal GPU architecture introduced hardware support for FP16 arithmetic, the Volta architecture, which powers the Summit supercomputer, 3 comes with hardware acceleration units (called Tensor  ... 
doi:10.1109/ipdps.2019.00022 dblp:conf/ipps/AbdelfattahTD19 fatcat:onwad4dv7faslikjgy7sf2wda4

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark [article]

Yuhang Li, Mingzhu Shen, Jian Ma, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, Fengwei Yu, Junjie Yan
2022 arXiv   pre-print
While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled.  ...  This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments.  ...  TensorRT [22] is a high-performance inference library developed by NVIDIA. The quantiza- tion scheme in TensorRT is symmetric per-channel for weights, and symmetric per-tensor for activations.  ... 
arXiv:2111.03759v2 fatcat:k5woc6az6zfgpcgx5dxgzhkfou

StereoSpike: Depth Learning with a Spiking Neural Network [article]

Ulysse Rançon, Javier Cuadrado-Anibarro, Benoit R. Cottereau, Timothée Masquelier
2021 arXiv   pre-print
We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy.  ...  Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that  ...  We would like to thank Amirreza Yousefzadeh for his help and expertise on digital neuromorphic hardware.  ... 
arXiv:2109.13751v2 fatcat:hcghwoxegzgeljc5oj7zb6sa5m

Task-Based Programming for Seismic Imaging: Preliminary Results

Lionel Boillot, George Bosilca, Emmanuel Agullo, Henri Calandra
2014 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)  
The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms  ...  The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards.  ...  The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms  ... 
doi:10.1109/hpcc.2014.205 dblp:conf/hpcc/BoillotBAC14 fatcat:qi23cnnewndmlj6idxnw7hpxta

InSight: An FPGA-Based Neuromorphic Computing System for Deep Neural Networks

Taeyang Hong, Yongshin Kang, Jaeyong Chung
2020 Journal of Low Power Electronics and Applications  
The computing system consists of a non-conventional compiler, a neuromorphic hardware architecture, and a space-efficient microarchitecture that leverages existing integrated circuit design methodologies  ...  This paper describes a neuromorphic computing system that is designed from the ground up for energy-efficient evaluation of deep neural networks.  ...  For each convolutional layer, we perform tensor factorization. For each fully-connected layer, we perform matrix factorization (Step 1). Both are based on singular value decomposition (SVD).  ... 
doi:10.3390/jlpea10040036 fatcat:qgl2htptxrfdpoyuufsnqm6obq
« Previous Showing results 1 — 15 out of 6,721 results