A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
On the Inversion of Multiple Matrices on GPU in Batched Mode
2018
Supercomputing Frontiers and Innovations
In [1] authors give design and implementation of batched matrix-matrix multiplication on GPUs. ...
There is no communication between systems (1), so we can analyse performance on a single GPU device and assume linear scaling of the problem for multiple GPUs. ...
Acknowledgments This work is supported by RFBR grant no. 17-07-00116 and by subprogram 0063-2016-0018 of the program III.3 ONIT RAS. ...
doi:10.14529/jsfi180203
fatcat:3vzug7idpfd75e5ksck5kbkhwy
Parallel Power Flow Computation Trends and Applications: A Review Focusing on GPU
2020
Energies
with GPU. ...
A power flow study aims to analyze a power system by obtaining the voltage and phase angle of buses inside the power system. ...
QR Decomposition LU decomposition is more commonly used to solve power flow problems, but it is also possible to adopt QR decomposition as an alternative for leveraging GPU-based parallel processing. ...
doi:10.3390/en13092147
fatcat:ofic3va3rnb6fjacse3yw2jbzm
Towards batched linear solvers on accelerated hardware platforms
2015
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015
Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU. ...
In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. ...
Parallel Swapping on GPUs Profiling the batched LU reveals that more than 60% of the time is spent in the swapping routine. ...
doi:10.1145/2688500.2688534
dblp:conf/ppopp/HaidarDLTD15
fatcat:ie4m7pqjzjg7naiy5an3mtjl3u
Towards batched linear solvers on accelerated hardware platforms
2015
SIGPLAN notices
Compared to a batched LU factorization featured in the NVIDIA's CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU. ...
In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. ...
Parallel Swapping on GPUs Profiling the batched LU reveals that more than 60% of the time is spent in the swapping routine. ...
doi:10.1145/2858788.2688534
fatcat:gpbwou4l7renlhjv3merfqwj54
A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
[chapter]
2015
Lecture Notes in Computer Science
Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K GPU. ⁴ Historically, similar issues were associated with strong scaling [ ] and were ...
But in order to benefit from the GPU's significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. ...
ACI-, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund, ...
doi:10.1007/978-3-319-20119-1_3
fatcat:hgupli7firaqhebgiuojf35xwq
Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression
[article]
2017
arXiv
pre-print
We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. ...
The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs. ...
Acknowledgments We thank the NVIDIA Corporation for providing access to the P100 GPU used in this work. ...
arXiv:1707.05141v1
fatcat:2zteuq5hsje4xk6in2qhon5eju
MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing
2015
2015 IEEE High Performance Extreme Computing Conference (HPEC)
and energy efficiency, on the Jetson TK1 development kit. ...
We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance ...
ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant ACI-1339822, the Department of Energy, and NVIDIA. ...
doi:10.1109/hpec.2015.7322444
dblp:conf/hpec/HaidarTLD15
fatcat:tvc75miwr5cjjh2ls4fwdet2fe
Linear algebra software for large-scale accelerated multicore computing
2016
Acta Numerica
Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value ...
decomposition (SVD) problems. ...
The solid curve shows the power consumption of our GPU implementation of the batched QR decomposition. ...
doi:10.1017/s0962492916000015
fatcat:cwsstweghjaj7ff6fu62lmn6ce
Investigation of the performance of LU decomposition method using CUDA
2012
Procedia Technology - Elsevier
In this study Graphics Processing Units (GPU) accelerated implementation of LU linear algebra routine is implemented. LU decomposition is a decomposition of the form A=LU where A is a square matrix. ...
The main idea of the LU decomposition is to record the steps used in Gaussian elimination on A in the places where the zero is produced. L and U are lower and upper triangular matrices respectively. ...
In one study the fastest implementations of dense LU, QR and Cholesky factorizations running on a single or double NVIDIA GPUs is presented [3] . ...
doi:10.1016/j.protcy.2012.02.011
fatcat:jaokyfnzlnbrngpgronwgie7zm
GPU-based N-1 Static Security Analysis Algorithm with Preconditioned Conjugate Gradient Method
2020
IEEE Access
Case studies on a practical 10828-bus system show that the GPU-based N-1 SSA algorithm with the batch-PCG solver is 4.90 times faster than a sequential algorithm on an 8-core CPU. ...
Second, it proposes a GPU-based batch-PCG solver, which packages a massive number of PCG subtasks into a large-scale problem to achieve a higher degree of parallelism and better coalesced memory accesses ...
As the rank-one update method solves single SLSE faster than LU decomposition, Solution 2 with multi-threaded Solver 2 is a superior SSA solution on the CPU platform. ...
doi:10.1109/access.2020.3004713
fatcat:k6wbjj6elngvfi2r376fudpibq
Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
2016
IEEE Transactions on Parallel and Distributed Systems
Due to their high processing power, Graphics Processing Units became an attractive target for this class of problems, and routines based on the LU and the QR factorization have been provided by NVIDIA ...
Due to the lack of a cuBLAS Cholesky factorization, execution rates of cuBLAS LU and cuBLAS QR are used for comparison against the proposed Cholesky factorization in this work. ...
ACKNOWLEDGMENTS This work is supported by grant #SHF-1320603: "Benchtesting Environment for Automated Software Tuning (BEAST)" from the National Science Foundation, by the Department of Energy grant #DE-SC0010042, and ...
doi:10.1109/tpds.2015.2481890
fatcat:uglchqysozci3f6hgey373tphu
LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU
2014
2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)
Our batched LU achieves up to 2.5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3.6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction ...
To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. ...
ACKNOWLEDGMENT The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA and MAGMA project support. ...
doi:10.1109/hpcc.2014.30
dblp:conf/hpcc/DongHLHTD14
fatcat:2eootml2bjdujlsx25jbiekgle
A Fast Batched Cholesky Factorization on a GPU
2014
2014 43rd International Conference on Parallel Processing
In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -nonblocked, blocked, and recursive blocked -were examined. ...
Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. ...
ACKNOWLEDGMENT The authors would like to thank the National Science Foundation, the Department of Energy, NVIDIA and MAGMA project support. ...
doi:10.1109/icpp.2014.52
dblp:conf/icpp/DongHTD14
fatcat:gnbqn4mvp5aejetof56wb33yh4
CUDA Accelerated Visual Egomotion Estimation for Robotic Navigation
2017
Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Egomotion estimation is a fundamental issue in structure from motion and autonomous navigation for mobile robots. ...
Five-point methods represent the minimal number of required correspondences to estimate the essential matrix, raised special interest for their application in a hypothesize-and-test framework. ...
We exploit the batched interface of LU factorization performing four GPU kernel calls for solving systems in the form (MX = b) as follows: 1. LU decomposition of M (P M = LU).2. ...
doi:10.5220/0006171501070114
dblp:conf/visapp/OuerghiBST17
fatcat:ugwfmecbrrcxtoyt4exjor73cm
CULA: hybrid GPU accelerated linear algebra routines
2010
Modeling and Simulation for Defense Systems and Applications V
We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. ...
The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. ...
eigenproblem solvers (general and symmetric), singular value decompositions, and many useful factorizations (QR, Hessenberg, etc.) ...
doi:10.1117/12.850538
fatcat:fx6e5zr6jbhxrdnvjyfim4eaie
« Previous
Showing results 1 — 15 out of 150 results