Filters








150 Hits in 4.3 sec

On the Inversion of Multiple Matrices on GPU in Batched Mode

2018 Supercomputing Frontiers and Innovations  
In [1] authors give design and implementation of batched matrix-matrix multiplication on GPUs.  ...  There is no communication between systems (1), so we can analyse performance on a single GPU device and assume linear scaling of the problem for multiple GPUs.  ...  Acknowledgments This work is supported by RFBR grant no. 17-07-00116 and by subprogram 0063-2016-0018 of the program III.3 ONIT RAS.  ... 
doi:10.14529/jsfi180203 fatcat:3vzug7idpfd75e5ksck5kbkhwy

Parallel Power Flow Computation Trends and Applications: A Review Focusing on GPU

Dong-Hee Yoon, Youngsun Han
2020 Energies  
with GPU.  ...  A power flow study aims to analyze a power system by obtaining the voltage and phase angle of buses inside the power system.  ...  QR Decomposition LU decomposition is more commonly used to solve power flow problems, but it is also possible to adopt QR decomposition as an alternative for leveraging GPU-based parallel processing.  ... 
doi:10.3390/en13092147 fatcat:ofic3va3rnb6fjacse3yw2jbzm

Towards batched linear solvers on accelerated hardware platforms

Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, Jack Dongarra
2015 Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015  
Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.  ...  In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel.  ...  Parallel Swapping on GPUs Profiling the batched LU reveals that more than 60% of the time is spent in the swapping routine.  ... 
doi:10.1145/2688500.2688534 dblp:conf/ppopp/HaidarDLTD15 fatcat:ie4m7pqjzjg7naiy5an3mtjl3u

Towards batched linear solvers on accelerated hardware platforms

Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, Jack Dongarra
2015 SIGPLAN notices  
Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.  ...  In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel.  ...  Parallel Swapping on GPUs Profiling the batched LU reveals that more than 60% of the time is spent in the swapping routine.  ... 
doi:10.1145/2858788.2688534 fatcat:gpbwou4l7renlhjv3merfqwj54

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations [chapter]

Azzam Haidar, Tingxing Tim Dong, Stanimire Tomov, Piotr Luszczek, Jack Dongarra
2015 Lecture Notes in Computer Science  
Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K GPU. ⁴ Historically, similar issues were associated with strong scaling [ ] and were  ...  But in order to benefit from the GPU's significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU.  ...  ACI-, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund,  ... 
doi:10.1007/978-3-319-20119-1_3 fatcat:hgupli7firaqhebgiuojf35xwq

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression [article]

Wajih Halim Boukaram, George Turkiyyah, Hatem Ltaief, David E. Keyes
2017 arXiv   pre-print
We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices.  ...  The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.  ...  Acknowledgments We thank the NVIDIA Corporation for providing access to the P100 GPU used in this work.  ... 
arXiv:1707.05141v1 fatcat:2zteuq5hsje4xk6in2qhon5eju

MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing

Azzam Haidar, Stanimire Tomov, Piotr Luszczek, Jack Dongarra
2015 2015 IEEE High Performance Extreme Computing Conference (HPEC)  
and energy efficiency, on the Jetson TK1 development kit.  ...  We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance  ...  ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA.  ... 
doi:10.1109/hpec.2015.7322444 dblp:conf/hpec/HaidarTLD15 fatcat:tvc75miwr5cjjh2ls4fwdet2fe

Linear algebra software for large-scale accelerated multicore computing

A. Abdelfattah, H. Anzt, J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, I. Yamazaki, A. YarKhan
2016 Acta Numerica  
Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value  ...  decomposition (SVD) problems.  ...  The solid curve shows the power consumption of our GPU implementation of the batched QR decomposition.  ... 
doi:10.1017/s0962492916000015 fatcat:cwsstweghjaj7ff6fu62lmn6ce

Investigation of the performance of LU decomposition method using CUDA

Caner Ozcan, Baha Sen
2012 Procedia Technology - Elsevier  
In this study Graphics Processing Units (GPU) accelerated implementation of LU linear algebra routine is implemented. LU decomposition is a decomposition of the form A=LU where A is a square matrix.  ...  The main idea of the LU decomposition is to record the steps used in Gaussian elimination on A in the places where the zero is produced. L and U are lower and upper triangular matrices respectively.  ...  In one study the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs is presented [3] .  ... 
doi:10.1016/j.protcy.2012.02.011 fatcat:jaokyfnzlnbrngpgronwgie7zm

GPU-based N-1 Static Security Analysis Algorithm with Preconditioned Conjugate Gradient Method

Meng Fu, Gan Zhou, Jiahao Zhao, Yanjun Feng, Huan He, Kai Liang
2020 IEEE Access  
Case studies on a practical 10828-bus system show that the GPU-based N-1 SSA algorithm with the batch-PCG solver is 4.90 times faster than a sequential algorithm on an 8-core CPU.  ...  Second, it proposes a GPU-based batch-PCG solver, which packages a massive number of PCG subtasks into a large-scale problem to achieve a higher degree of parallelism and better coalesced memory accesses  ...  As the rank-one update method solves single SLSE faster than LU decomposition, Solution 2 with multi-threaded Solver 2 is a superior SSA solution on the CPU platform.  ... 
doi:10.1109/access.2020.3004713 fatcat:k6wbjj6elngvfi2r376fudpibq

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Jakub Kurzak, Hartwig Anzt, Mark Gates, Jack Dongarra
2016 IEEE Transactions on Parallel and Distributed Systems  
Due to their high processing power, Graphics Processing Units became an attractive target for this class of problems, and routines based on the LU and the QR factorization have been provided by NVIDIA  ...  Due to the lack of a cuBLAS Cholesky factorization, execution rates of cuBLAS LU and cuBLAS QR are used for comparison against the proposed Cholesky factorization in this work.  ...  ACKNOWLEDGMENTS This work is supported by grant #SHF-1320603: "Benchtesting Environment for Automated Software Tuning (BEAST)" from the National Science Foundation, by the Department of Energy grant #DE-SC0010042, and  ... 
doi:10.1109/tpds.2015.2481890 fatcat:uglchqysozci3f6hgey373tphu

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU

Tingxing Dong, Azzam Haidar, Piotr Luszczek, James Austin Harris, Stanimire Tomov, Jack Dongarra
2014 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)  
Our batched LU achieves up to 2.5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3.6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction  ...  To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches.  ...  ACKNOWLEDGMENT The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA and MAGMA project support.  ... 
doi:10.1109/hpcc.2014.30 dblp:conf/hpcc/DongHLHTD14 fatcat:2eootml2bjdujlsx25jbiekgle

A Fast Batched Cholesky Factorization on a GPU

Tingxing Dong, Azzam Haidar, Stanimire Tomov, Jack Dongarra
2014 2014 43rd International Conference on Parallel Processing  
In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -nonblocked, blocked, and recursive blocked -were examined.  ...  Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU.  ...  ACKNOWLEDGMENT The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA and MAGMA project support.  ... 
doi:10.1109/icpp.2014.52 dblp:conf/icpp/DongHTD14 fatcat:gnbqn4mvp5aejetof56wb33yh4

CUDA Accelerated Visual Egomotion Estimation for Robotic Navigation

Safa Ouerghi, Remi Boutteau, Xavier Savatier, Fethi Tlili
2017 Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications  
Egomotion estimation is a fundamental issue in structure from motion and autonomous navigation for mobile robots.  ...  Five-point methods represent the minimal number of required correspondences to estimate the essential matrix, raised special interest for their application in a hypothesize-and-test framework.  ...  We exploit the batched interface of LU factorization performing four GPU kernel calls for solving systems in the form (MX = b) as follows: 1. LU decomposition of M (P M = LU).2.  ... 
doi:10.5220/0006171501070114 dblp:conf/visapp/OuerghiBST17 fatcat:ugwfmecbrrcxtoyt4exjor73cm

CULA: hybrid GPU accelerated linear algebra routines

John R. Humphrey, Daniel K. Price, Kyle E. Spagnoli, Aaron L. Paolini, Eric J. Kelmelis, Eric J. Kelmelis
2010 Modeling and Simulation for Defense Systems and Applications V  
We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares.  ...  The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance.  ...  eigenproblem solvers (general and symmetric), singular value decompositions, and many useful factorizations (QR, Hessenberg, etc.)  ... 
doi:10.1117/12.850538 fatcat:fx6e5zr6jbhxrdnvjyfim4eaie
« Previous Showing results 1 — 15 out of 150 results