13 Hits in 3.1 sec

Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs

Pai-Wei Lai, Humayun Arafat, Venmugil Elango, P. Sadayappan
2013 20th Annual International Conference on High Performance Computing  
Implementation of Strassen's matrix multiplication algorithm for arbitrary size matrices Empirically driven cost model for choosing cut-off for switching from Strassen's to standard matrix multiplication  ...  algorithm C 11 C 12 C 21 C 22 = A 11 A 12 A 21 A 22 × B 11 B 12 B 21 B 22 Block decomposition of matrix multiplication Implementation of Strassen's algorithm. cannot fully exploit benifits  ... 
doi:10.1109/hipc.2013.6799109 dblp:conf/hipc/LaiAES13 fatcat:luj5o3ju5zafrjfzykrvqsbzku

Author index

2013 20th Annual International Conference on High Performance Computing  
and Enhancement of Weather Application Performance on Blue Gene/Q Sadayappan, Ponnuswamy Accelerating Strassen-Winograd's Matrix Multiplication Algorithm on GPUs Saxena, Vaibhav Evaluation and Enhancement  ...  Strassen-Winograd's Matrix Multiplication Algorithm on GPUs Fensch, Christian MaSiF: Machine Learning Guided Auto-tuning of Parallel Skeletons Fu, Songling iFlatLFS: Performance Optimization for Accessing  ... 
doi:10.1109/hipc.2013.6799145 fatcat:jnmign7535ep5jvhn65xp4aqqe

A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply [article]

Paolo D'Alberto
2012 arXiv   pre-print
These APU processors provide multiple symmetric cores with their memory hierarchies and an integrated GPU.  ...  We present a case study for the development of dense Matrix Multiplication (MM) codes for matrix sizes up to 19K\times19K, thus using all of the above computational engines, and an achievable peak performance  ...  Lastly, we thank Matthew Badin, Alexandru Nicolau, Michael Dillencourt for the conversations about GPUs.  ... 
arXiv:1205.2927v1 fatcat:uxxt2nhw2ffqrbltzbhn2jebxa

Convolution Accelerator Designs Using Fast Algorithms

Yulin Zhao, Donghui Wang, Leiou Wang
2019 Algorithms  
The implementation results show that the power consumption of the accelerator design based on the Strassen–Winograd algorithm is 21.3% less than that of conventional accelerators.  ...  To overcome these difficulties, this paper proposes several convolution accelerator designs using fast algorithms.  ...  Winograd's variant of the Strassen algorithm only needs 15 additions [23] , and it achieves relatively good results on GPUs [24] .  ... 
doi:10.3390/a12050112 fatcat:sb276imvbvglree2dqwcnptuga

CENNA: Cost-Effective Neural Network Accelerator

Sang-Soo Park, Ki-Seok Chung
2020 Electronics  
In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen's multiplication  ...  Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic.  ...  A GPU can accelerate CNNs quickly, but in a battery-powered embedded system, relying heavily on GPUs may lead to an unacceptably large amount of energy dissipation [8] .  ... 
doi:10.3390/electronics9010134 fatcat:4neuwhwhn5fqlmeb5ip3d6bbei

Deep Tensor Convolution on Multicores [article]

David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit
2017 arXiv   pre-print
These networks have improved performance of video and volumetric image analysis, but have been limited in size due to the low memory ceiling of GPU hardware.  ...  Second, we maximize CPU utilization and multicore scalability by transforming data matrices to be cache-aware, integer multiples of AVX vector widths.  ...  More generally, one could also apply the Strassen algorithm to reduce the number of steps required for matrix multiplication (Cong & Xiao, 2014) .  ... 
arXiv:1611.06565v3 fatcat:ouzr3bssdnftxe6zz5nmxdow7e

Effective and High Computing Algorithms for Convolution Neural Networks

P Syamala Rao, Dr G.P.SaradhiVarma, Rajasekhar Mutukuri
2018 International Journal of Engineering & Technology  
By using Winograd's minimal filtering algorithms the new class of fast algorithms for convolutional neural networks was introduced by us.  ...  With the VGG network, we benchmark a GPU implementation of our algorithm and at batch sizes from 1 to 64 state of the art throughput was shown.  ...  To reduce the convolutions in a convent layer cong and xiao [7] used the Strassen algorithm for fast matrix multiplication and arithmetic complexity was reduced.  ... 
doi:10.14419/ijet.v7i3.31.18203 fatcat:24m3ebi5dbccjg5l7j6pdb6e7q

Fast Feasible and Unfeasible Matrix Multiplication [article]

Victor Y. Pan
2018 arXiv   pre-print
Fast matrix-by-matrix multiplication (hereafter MM) is a highly recognized research subject.  ...  We first survey the mainstream study of the acceleration of MM of unbounded sizes, cover the progress in decreasing the exponents of MM, comment on its impact on the theory and practice of computing, and  ...  (so far in practice they are mostly recursive bilinear algorithm for MM based on Winograd's 2 × 2 MM of Example 2.3 or less frequently based on Strassen's 2 × 2 MM of 2.2, but also Kaporin's algorithms  ... 
arXiv:1804.04102v1 fatcat:tki5pawehbgorhtcvwtdegkgee

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis [article]

Tal Ben-Nun, Torsten Hoefler
2018 arXiv   pre-print
Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design.  ...  Based on those approaches, we extrapolate potential directions for parallelism in deep learning.  ...  as in Winograd's algorithm (implementation in Appendix C).  ... 
arXiv:1802.09941v2 fatcat:ne2wiplln5eavjvjwf5to7nwsu

Multi-Component Optimization and Efficient Deployment of Neural-Networks on Resource-Constrained IoT Hardware [article]

Bharath Sudharsan, Dineshkumar Sundaram, Pankesh Patel, John G. Breslin, Muhammad Intizar Ali, Schahram Dustdar, Albert Zomaya, Rajiv Ranjan
2022 arXiv   pre-print
that can comfortably fit and execute on resource-constrained hardware.  ...  Researchers and developers can use our optimization sequence to optimize high memory, computation demanding models in multiple aspects in order to produce small size, low latency, low-power consuming models  ...  Analyze the linear algebraic properties [28] of a NN model and apply algorithms such as Strassen Gaussian elimination, Winograd's minimal filtering [29] to reduce the computational workload, resulting  ... 
arXiv:2204.10183v1 fatcat:7yelkcwgdvcg5n4t4tmwymsln4

High Performance and Portable Convolution Operators for ARM-based Multicore Processors [article]

Pablo San Juan, Adrián Castelló, Manuel F. Dolz, Pedro Alonso-Jordá, Enrique S. Quintana-Ortí
2020 arXiv   pre-print
One of these approaches leverages the \imcol transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear  ...  The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in  ...  In some cases, the gemm-based approach can be accelerated employing Winograd's minimal filtering algorithms, possibly combined with the Strassen variant of the matrix multiplication [25, 39] .  ... 
arXiv:2005.06410v1 fatcat:omytbc6xbfasvaz3n3sco4nbra

Communication lower bounds and optimal algorithms for numerical linear algebra

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, O. Schwartz
2014 Acta Numerica  
Some of these generalize known lower bounds for dense classical (O(n 3 )) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices  ...  First we describe lower bounds on communication.  ...  As an alternative, one can perform matrix equilibration. For nonsymmetric matrices, this involves applying diagonal row and column  ... 
doi:10.1017/s0962492914000038 fatcat:43lzwu73vzbk3dvlq3zk5gydfy

Performance engineering of data-intensive applications

Arya Mazaheri
We start with performance profiling to gain insights on thread communications and relevant code optimizations.  ...  Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the  ...  Thus, we also aim to accelerate such layers via optimized algorithms and obtain reasonably high performance on a wide variety of GPUs.  ... 
doi:10.26083/tuprints-00021078 fatcat:nskwgb2vxvew7egtrzwlzebiri