774 Hits in 5.4 sec

Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

M. Ozan Karsavuran, Kadir Akbudak, Cevdet Aykanat
2016 IEEE Transactions on Parallel and Distributed Systems  
Sparse matrix-vector and matrix-transpose-vector multiplication (SpMM T V) repeatedly performed as z A T x and y A z (or y A w) for the same sparse matrix A is a kernel operation widely used in various  ...  Index Terms-Cache locality, sparse matrix, sparse matrix-vector multiplication, matrix reordering, singly bordered block-diagonal form, Intel Many Integrated Core Architecture (Intel MIC), Intel Xeon Phi  ...  ACKNOWLEDGMENTS This work was partially supported by the PRACE 4IP project funded in part by Horizon 2020 The EU Framework Programme for Research and Innovation (2014-2020) under grant agreement number  ... 
doi:10.1109/tpds.2015.2453970 fatcat:bnwlk426mrbbldag3cxrq2bkyy

Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs using OpenACC [article]

Pieter Maris, Chao Yang, Dossay Oryspayev, Brandon Cook
2021 arXiv   pre-print
We compare the performance of the OpenACC based implementation executed on multiple GPUs with the performance on distributed-memory many-core CPUs, and demonstrate significant speedup achieved on GPUs  ...  compared to the on-node performance of a many-core CPU.  ...  We would like to thank Mathew Colgrove and Brent Leback from NVIDIA, as well as Ron Caplan from Predictive Sciences Inc. for their valuable suggestions on porting the LOBPCG solver to OpenACC.  ... 
arXiv:2109.00485v1 fatcat:4f243aknszg43ljupfwxrqfxnu

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydin Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary
2017 IEEE Transactions on Parallel and Distributed Systems  
We consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations.  ...  As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware  ...  An existing implementation of CSB for sparse matrix-vector (SpMV) and transpose sparse matrix-vector (SpMV T ) multiplication stores nonzeros within each block using a space filling curve to exploit data  ... 
doi:10.1109/tpds.2016.2630699 fatcat:6w4u7qyec5ehtjld4deuq45cwe

Architectural support for uniprocessor and multiprocessor active memory systems

Daehyun Kim, M. Chaudhuri, M. Heinrich, E. Speight
2004 IEEE transactions on computers  
We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on  ...  However, they create coherence problems since the processor is allowed to refer to the same data via more than one address.  ...  Finally, the SMVM microbenchmark carries out the sparse matrix vector multiplication kernel.  ... 
doi:10.1109/tc.2004.1261836 fatcat:2ewquko6ivexff5zifctdxlvie

Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

Grey Ballard, Christopher Siefert, Jonathan Hu
2016 SIAM Journal on Scientific Computing  
On large-scale distributed-memory parallel machines, the computation time of the setup phase is dominated by a sequence of sparse matrix-matrix multiplication (SpMMs) involving matrices distributed across  ...  The standard approach for performing each of these parallel sparse matrix multiplications is to use a row-wise algorithm: for general C = A · B, each processor owns a subset of the rows of A, a subset  ...  as a sparse matrix multiplication.  ... 
doi:10.1137/15m1028807 fatcat:idsejyuelnbvrjndf7lv3geeda

Parallel Breadth-First Search on Distributed Memory Systems [article]

Aydin Buluc, Kamesh Madduri
2011 arXiv   pre-print
We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse  ...  matrix-partitioning-based approach that mitigates parallel communication overhead.  ...  Gilbert, Steve Reinhardt, and Adam Lugowski greatly improved our understanding of casting BFS iterations into sparse linear algebra.  ... 
arXiv:1104.4518v2 fatcat:a7nvtwil35dbtohgpnsfldeeki

Parallel breadth-first search on distributed memory systems

Aydin Buluç, Kamesh Madduri
2011 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11  
We present two highly-tuned parallel approaches for BFS on large parallel systems: a levelsynchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse  ...  matrix partitioning-based approach that mitigates parallel communication overhead.  ...  A sparse graph can analogously be viewed as a sparse matrix, and optimization strategies for linear algebra computations similar to BFS, such as sparse matrix-vector multiplication [36] , may be translated  ... 
doi:10.1145/2063384.2063471 dblp:conf/sc/BulucM11 fatcat:cn4tlzqd4ndqlhekngx76hjvhy

Variable-size batched Gauss–Jordan elimination for block-Jacobi preconditioning on graphics processors

Hartwig Anzt, Jack Dongarra, Goran Flegar, Enrique S. Quintana-Ortí
2018 Parallel Computing  
To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication  ...  cores.  ...  GJE for matrix inversion GJE has been proposed in the last years as an efficient method for matrix inversion on clusters of multicore processors and many-core hardware 90 accelerators [4, 5] .  ... 
doi:10.1016/j.parco.2017.12.006 fatcat:e5mmhrxsvbeuhlffm4m6k6nkki

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi [chapter]

Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, Stanimire Tomov
2014 Lecture Notes in Computer Science  
), and Many Integrated Core (MIC) architectures.  ...  The common practice in parallel matrix-vector multiplication is to assign a fixed set of rows/columns to each processing unit for the A non-transpose/transpose cases, respectively.  ...  Left: SpMV kernel implementation for the SELL-P sparse matrix format for t = 8, see (10) .  ... 
doi:10.1007/978-3-642-55224-3_53 fatcat:ujwldqeaxnduvakc6iymf5u3ai

Sparse Tensor Algebra as a Parallel Programming Model [article]

Edgar Solomonik, Torsten Hoefler
2015 arXiv   pre-print
We show that sparse tensor algebra can also be used to express many of the transformations on these datasets, especially those which are parallelizable.  ...  Tensor computations are a natural generalization of matrix and graph computations.  ...  Sparse-matrix vector multiplication is a ubiquitous primitive for such methods.  ... 
arXiv:1512.00066v1 fatcat:ucwvhncgzrcytepo6jumohng4y

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Aydin Buluç, Samuel Williams, Leonid Oliker, James Demmel
2011 2011 IEEE International Parallel & Distributed Processing Symposium  
Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods.  ...  Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability.  ...  ACKNOWLEDGMENTS We thank Grey Ballard of UC Berkeley for his constructive comments on the paper, especially the algorithms analysis.  ... 
doi:10.1109/ipdps.2011.73 dblp:conf/ipps/BulucWOD11 fatcat:37gp3czbwzaqflrypenxnboyz4

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs [article]

Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal Burns
2017 arXiv   pre-print
To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning.  ...  We evaluate FlashR on a variety of machine learning and statistics algorithms on inputs of up to four billion data points. FlashR out-of-core tracks closely the performance of FlashR in-memory.  ...  FlashR focuses on optimizations in a single machine (with multiple CPUs and many cores) and scales matrix operations beyond memory capacity by utilizing solid-state drives (SSDs).  ... 
arXiv:1604.06414v4 fatcat:dsobnkm2tbbn5oe4h4orqzdzfa

Reducing Communication in Algebraic Multigrid with Multi-step Node Aware Communication [article]

Amanda Bienz, Luke Olson, William Gropp
2019 arXiv   pre-print
Algebraic multigrid (AMG) is often viewed as a scalable O(n) solver for sparse linear systems.  ...  This work introduces a parallel implementation of AMG to reduce the cost of communication, yielding an increase in scalability.  ...  ACKNOWLEDGMENTS This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state  ... 
arXiv:1904.05838v2 fatcat:artym4ja6jf3nkuykmmatqmlum

Low-overhead load-balanced scheduling for sparse tensor computations

Muthu Baskaran, Benoit Meister, Richard Lethin
2014 2014 IEEE High Performance Extreme Computing Conference (HPEC)  
We achieve around 4-5x improvement in performance over existing parallel approaches and observe "scalable" parallel performance on modern multicore systems with up to 32 processor cores.  ...  and data locality.  ...  , X → X (n) • Matrix transpose • Inverse of a diagonal matrix • Pseudo-inverse of a matrix, V † • Matrix matrix multiplication, C = AB • Matrix Khatri-Rao product, C = A ⊙ B • Matrix element-wise division  ... 
doi:10.1109/hpec.2014.7041006 dblp:conf/hpec/BaskaranML14 fatcat:g7srlfdtrvgltda4fiqrrsmwbi

Acceleration of GPU-based Krylov solvers via data transfer reduction

Hartwig Anzt, Stanimire Tomov, Piotr Luszczek, William Sawyer, Jack Dongarra
2015 The international journal of high performance computing applications  
designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit's computing power.  ...  Considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as sparse matrix-vector, are crucial  ...  The common practice in parallel matrix-vector multiplication is to assign a fixed set of rows/columns to each processing unit for the A non-transpose/transpose cases, respectively.  ... 
doi:10.1177/1094342015580139 fatcat:nu6cfe5puzes3cchq4rvj7teju
« Previous Showing results 1 — 15 out of 774 results