83 Hits in 6.3 sec

Caffe Barista: Brewing Caffe with FPGAs in the Training Loop [article]

Diederik Adriaan Vink, Aditya Rajagopal, Stylianos I. Venieris, Christos-Savvas Bouganis
2020 arXiv   pre-print
On the other hand, training is both more compute- and memory-intensive and is primarily performed on power-hungry GPUs in large-scale data centres.  ...  This is primarily due to the lack of tools to easily prototype and deploy various hardware and/or algorithmic techniques for power-efficient CNN training.  ...  The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1) is gratefully acknowledged.  ... 
arXiv:2006.13829v1 fatcat:p7y4nindu5csfhvtlc3pk6fvpq

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight [article]

Jiarui Fang, Liandeng Li, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang
2019 arXiv   pre-print
This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world  ...  Finally, we present the scalability of swCaffe for the training of ResNet-50 and AlexNet on the scale of 1024 nodes.  ...  (TFlops ) 3.02 1.43 3.46 TABLE II : II Combination of explicit and Implicit GEMM transformation on one CG for Convolutional Layer in VGG-16 with batch-size = 128Comparison of GPU and SW26010 in  ... 
arXiv:1903.06934v1 fatcat:m5dbajx3urhyzoy6t5yf4dwwg4

Flexible Performant GEMM Kernels on GPUs [article]

Thomas Faingnaert, Tim Besard, Bjorn De Sutter
2021 arXiv   pre-print
General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores.  ...  The interfaces and abstractions are co-designed for researchers' needs and Julia's features to achieve sufficient separation of concerns and flexibility to easily extend basic GEMMs in many different ways  ...  So far, we focused on flexible GEMMs for CUDA-enabled GPUs.  ... 
arXiv:2009.12263v4 fatcat:illumnxadvhsxln34o7aqaaa2y

TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs [article]

Yuke Wang, Boyuan Feng, Yufei Ding
2021 arXiv   pre-print
We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming.  ...  To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU.  ...  [2] process the batched small-size GEMM on TCU for acceleration. Boyuan et al. [8] introduce GEMM-based scientific computing on TCU with extended precision and high performance.  ... 
arXiv:2112.02052v1 fatcat:vwexjclt5re7doipvdosjsi7ly

On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra
2016 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)  
This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs).  ...  We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU.  ...  CSR 1514286, NVIDIA, the Department of Energy, and in part by the Russian Scientific Foundation, Agreement N14-11-00190.  ... 
doi:10.1109/ipdpsw.2016.190 dblp:conf/ipps/AbdelfattahHTD16 fatcat:m4e72q2lbfhm5m5y5x6mu2zucq

H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation [article]

Wajih Boukaram and Stefano Zampini and George Turkiyyah and David Keyes
2021 arXiv   pre-print
In this work, we develop a dynamic batching operation and combine it with batched adaptive randomized approximations to achieve high performance both on GPUs and CPUs.  ...  Our implementation attains over 1.2 TFLOP/s in double precision on the V100 GPU, and is limited by the performance of batched GEMM operations.  ...  In order to efficiency utilize processing cores, especially on GPUs, compressions have to be batched.  ... 
arXiv:2108.11932v1 fatcat:z7jndic5szfoblqrchpo2pm5my

SPOTS: An Accelerator for Sparse Convolutional Networks Leveraging Systolic General Matrix-Matrix Multiplication [article]

Mohammadreza Soltaniyeh, Richard P. Martin, Santosh Nagarakatte
2021 arXiv   pre-print
Our prototype, SPOTS, is on average 1.74X faster than Eyeriss. It is also 78X, and 12X more energy-efficient when compared to CPU and GPU implementations, respectively.  ...  We propose a novel design for the IM2COL unit that uses a set of distributed local memories connected by a ring network, which improves energy efficiency and latency by streaming the input feature map  ...  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.  ... 
arXiv:2107.13386v2 fatcat:k7oampka5rdztojmmwrr2yvnfm


Subhankar Pal, Siying Feng, Dong-hyeon Park, Sung Kim, Aporva Amarnath, Chi-Sheng Yang, Xin He, Jonathan Beaumont, Kyle May, Yan Xiong, Kuba Kaszyk, John Magnus Morton (+8 others)
2020 Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques  
Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications  ...  Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×) over a high-end CPU and GPU, respectively, across a diverse set of kernels  ...  Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.  ... 
doi:10.1145/3410463.3414627 dblp:conf/IEEEpact/PalFPKAYHBMXKMS20 fatcat:kwsaun2g65b6jl6mdqrhgiv7yq

Morphling: A Reconfigurable Architecture for Tensor Computation

Liqiang Lu, Yun Liang
2021 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
Furthermore, to efficiently support sparse tensor, we design a tiled-BCSR format that enables high parallelism and balanced workload.  ...  Overall, Morphling achieves 13.4X, 677.7X, 44.7X energy efficiency over Xilinx ZC706 FPGA, Intel i7-9700K CPU, and NVIDIA TitanX GPU.  ...  To achieve a balanced workload for each PE, the blocks are batched into multiple tiles with each tile contains multiple continuous rows, as shown in Figure 3 (b) .  ... 
doi:10.1109/tcad.2021.3135322 fatcat:5omvjoxy3zd7jgaear5swwjlou

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding [article]

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
2022 arXiv   pre-print
This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs.  ...  Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform  ...  We now look at the execution time breakdown for the CoLA dataset at batch size 32 on the Nvidia GPU shown in Fig. 24 . We see that CORA performs slightly worse than FT-Eff for this case.  ... 
arXiv:2110.10221v3 fatcat:fzb2fcizkvbsvlbihab2ceagsi

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units [article]

Yujeong Choi, Minsoo Rhu
2019 arXiv   pre-print
This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput  ...  To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests.  ...  . • GEMM_OP: performs a matrix-multiplication between the weight tile (SW×SH) and input activation tile (SH×ACC) using the GEMM unit, generating the output activation tile (SW×ACC) that is stored into  ... 
arXiv:1909.04548v1 fatcat:mwsbnwmt6bcpxnozcjp56gjhtm

Making Convolutions Resilient via Algorithm-Based Error Detection Techniques [article]

Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, Stephen W. Keckler
2020 arXiv   pre-print
TensorRT on GPUs).  ...  Algorithmic techniques are known to offer low-cost solutions, but the practical feasibility and performance of such techniques have never been studied for CNN deployment platforms (e.g., TensorFlow or  ...  Since cuDNN uses GEMM as a method to perform the int8 convolutions and GEMMs use tiling, the sharp increase in the runtime is likely due to the use of an additional tile.  ... 
arXiv:2006.04984v1 fatcat:tvb3azqlrvbxripqnvbobstnii

Linear algebra software for large-scale accelerated multicore computing

A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, A. YarKhan
2016 Acta Numerica  
The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight  ...  To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks.  ...  Finally, all the tiles to the right of the panel (the trailing submatrix) are updated, using a symmetric rank-k update (SYRK) for diagonal tiles A(m,m) and matrix multiplication (GEMM) on the other tiles  ... 
doi:10.1017/s0962492916000015 fatcat:cwsstweghjaj7ff6fu62lmn6ce

Fast Algorithms for Convolutional Neural Networks

Andrew Lavin, Scott Gray
2016 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)  
We benchmark a GPU implementation of our algorithm with the VGG network and show state of the art throughput at batch sizes from 1 to 64.  ...  The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes.  ...  Substituting U = GgG T and V = B T dB, we have: Y = A T U ⊙ V A (9) Labeling tile coordinates as ( x, y), we rewrite the convnet layer formula (2) for a single image i, filter k, and tile coordinate (  ... 
doi:10.1109/cvpr.2016.435 dblp:conf/cvpr/LavinG16 fatcat:o45ib3sf7jbr3kyx4rffxhnnjq

TC-CIM: Empowering Tensor Comprehensions for Computation in Memory

Andi Drebes, Lorenzo Chelini, Oleksandr Zinenko, Albert Cohen, Henk Corporaal, Tobias Grosser, Kanishkan Vadivel, Nicolas Vasilache
2020 Zenodo  
A major challenge for the programmability and exploitation of such Computing-In-Memory (CIM) architectures consists in the efficient mapping of tensor operations from high-level ML frameworks to fixed-function  ...  Operations suitable for acceleration are identified using Loop Tactics, a declarative framework to describe computational patterns in a polyhedral representation.  ...  as through Polly Labs (Xilinx Inc, Facebook Inc, and ARM Holdings) and the Swiss National Science Foundation through the Ambizione program.  ... 
doi:10.5281/zenodo.3736308 fatcat:k4yrzeo4jjdwvdrizc77poxblq
« Previous Showing results 1 — 15 out of 83 results