Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

2016
*
IEEE Transactions on Parallel and Distributed Systems
*

*Sparse*

*matrix*-

*vector*

*and*

*matrix*-

*transpose*-

*vector*

*multiplication*(SpMM T V) repeatedly performed as z A T x

*and*y A z (or y A w) for the same

*sparse*

*matrix*A is a kernel operation widely used in various ... Index Terms-Cache

*locality*,

*sparse*

*matrix*,

*sparse*

*matrix*-

*vector*

*multiplication*,

*matrix*reordering, singly bordered block-diagonal form, Intel

*Many*Integrated

*Core*Architecture (Intel MIC), Intel Xeon Phi ... ACKNOWLEDGMENTS This work was partially supported by the PRACE 4IP project funded in part by Horizon 2020 The EU Framework Programme for Research

*and*Innovation (2014-2020) under grant agreement number ...

##
###
Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs using OpenACC
[article]

2021
*
arXiv
*
pre-print

We compare the performance of the OpenACC based implementation executed

arXiv:2109.00485v1
fatcat:4f243aknszg43ljupfwxrqfxnu
*on**multiple*GPUs with the performance*on*distributed-memory*many*-*core*CPUs,*and*demonstrate significant speedup achieved*on*GPUs ... compared to the*on*-node performance of a*many*-*core*CPU. ... We would like to thank Mathew Colgrove*and*Brent Leback from NVIDIA, as well as Ron Caplan from Predictive Sciences Inc. for their valuable suggestions*on*porting the LOBPCG solver to OpenACC. ...##
###
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

2017
*
IEEE Transactions on Parallel and Distributed Systems
*

We consider a block iterative eigensolver whose main computational kernels are the

doi:10.1109/tpds.2016.2630699
fatcat:6w4u7qyec5ehtjld4deuq45cwe
*multiplication*of a*sparse**matrix*with*multiple**vectors*(SpMM),*and*tall-skinny*matrix*operations. ... As*on*-node*parallelism*increases*and*the performance gap between the*processor**and*the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-*aware*... An existing implementation of CSB for*sparse**matrix*-*vector*(SpMV)*and**transpose**sparse**matrix*-*vector*(SpMV T )*multiplication*stores nonzeros within each block using a space filling curve to exploit data ...##
###
Architectural support for uniprocessor and multiprocessor active memory systems

2004
*
IEEE transactions on computers
*

We also show remarkable performance improvement

doi:10.1109/tc.2004.1261836
fatcat:2ewquko6ivexff5zifctdxlvie
*on*small to medium-scale SMP*and*DSM multiprocessors, allowing some*parallel*applications to continue to scale long after their performance levels off*on*... However, they create coherence problems since the*processor*is allowed to refer to the same data via more than*one*address. ... Finally, the SMVM microbenchmark carries out the*sparse**matrix**vector**multiplication*kernel. ...##
###
Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

2016
*
SIAM Journal on Scientific Computing
*

*On*large-scale distributed-memory

*parallel*machines, the computation time of the setup phase is dominated by a sequence of

*sparse*

*matrix*-

*matrix*

*multiplication*(SpMMs) involving matrices distributed across ... The standard approach for performing each of these

*parallel*

*sparse*

*matrix*

*multiplications*is to use a row-wise algorithm: for general C = A · B, each

*processor*owns a subset of the rows of A, a subset ... as a

*sparse*

*matrix*

*multiplication*. ...

##
###
Parallel Breadth-First Search on Distributed Memory Systems
[article]

2011
*
arXiv
*
pre-print

We present two highly-tuned

arXiv:1104.4518v2
fatcat:a7nvtwil35dbtohgpnsfldeeki
*parallel*approaches for BFS*on*large*parallel*systems: a level-synchronous strategy that relies*on*a simple vertex-based partitioning of the graph,*and*a two-dimensional*sparse*...*matrix*-partitioning-based approach that mitigates*parallel*communication overhead. ... Gilbert, Steve Reinhardt,*and*Adam Lugowski greatly improved our understanding of casting BFS iterations into*sparse*linear algebra. ...##
###
Parallel breadth-first search on distributed memory systems

2011
*
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
*

We present two highly-tuned

doi:10.1145/2063384.2063471
dblp:conf/sc/BulucM11
fatcat:cn4tlzqd4ndqlhekngx76hjvhy
*parallel*approaches for BFS*on*large*parallel*systems: a levelsynchronous strategy that relies*on*a simple vertex-based partitioning of the graph,*and*a two-dimensional*sparse*...*matrix*partitioning-based approach that mitigates*parallel*communication overhead. ... A*sparse*graph can analogously be viewed as a*sparse**matrix*,*and*optimization strategies for linear algebra computations similar to BFS, such as*sparse**matrix*-*vector**multiplication*[36] , may be translated ...##
###
Variable-size batched Gauss–Jordan elimination for block-Jacobi preconditioning on graphics processors

2018
*
Parallel Computing
*

To fully realize this implementation, we develop a variable-size batched

doi:10.1016/j.parco.2017.12.006
fatcat:e5mmhrxsvbeuhlffm4m6k6nkki
*matrix*inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched*matrix*-*vector**multiplication*...*cores*. ... GJE for*matrix*inversion GJE has been proposed in the last years as an efficient method for*matrix*inversion*on*clusters of multicore*processors**and**many*-*core*hardware 90 accelerators [4, 5] . ...##
###
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
[chapter]

2014
*
Lecture Notes in Computer Science
*

),

doi:10.1007/978-3-642-55224-3_53
fatcat:ujwldqeaxnduvakc6iymf5u3ai
*and**Many*Integrated*Core*(MIC) architectures. ... The common practice in*parallel**matrix*-*vector**multiplication*is to assign a fixed set of rows/columns to each processing unit for the A non-*transpose*/*transpose*cases, respectively. ... Left: SpMV kernel implementation for the SELL-P*sparse**matrix*format for t = 8, see (10) . ...##
###
Sparse Tensor Algebra as a Parallel Programming Model
[article]

2015
*
arXiv
*
pre-print

We show that

arXiv:1512.00066v1
fatcat:ucwvhncgzrcytepo6jumohng4y
*sparse*tensor algebra can also be used to express*many*of the transformations*on*these datasets, especially those which are parallelizable. ... Tensor computations are a natural generalization of*matrix**and*graph computations. ...*Sparse*-*matrix**vector**multiplication*is a ubiquitous primitive for such methods. ...##
###
Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

2011
*
2011 IEEE International Parallel & Distributed Processing Symposium
*

Multiplying a

doi:10.1109/ipdps.2011.73
dblp:conf/ipps/BulucWOD11
fatcat:37gp3czbwzaqflrypenxnboyz4
*sparse**matrix*(as well as its*transpose*in the unsymmetric case) with a dense*vector*is the*core*of*sparse*iterative methods. ... Our work shows how to incorporate this transformation into existing*parallel*algorithms (both symmetric*and*unsymmetric) without limiting their*parallel*scalability. ... ACKNOWLEDGMENTS We thank Grey Ballard of UC Berkeley for his constructive comments*on*the paper, especially the algorithms analysis. ...##
###
FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs
[article]

2017
*
arXiv
*
pre-print

To reduce data movement between CPU

arXiv:1604.06414v4
fatcat:dsobnkm2tbbn5oe4h4orqzdzfa
*and*SSDs, FlashR evaluates*matrix*operations lazily, fuses operations at runtime,*and*uses cache-*aware*, two-level*matrix*partitioning. ... We evaluate FlashR*on*a variety of machine learning*and*statistics algorithms*on*inputs of up to four billion data points. FlashR out-of-*core*tracks closely the performance of FlashR in-memory. ... FlashR focuses*on*optimizations in a single machine (with*multiple*CPUs*and**many**cores*)*and*scales*matrix*operations beyond memory capacity by utilizing solid-state drives (SSDs). ...##
###
Reducing Communication in Algebraic Multigrid with Multi-step Node Aware Communication
[article]

2019
*
arXiv
*
pre-print

Algebraic multigrid (AMG) is often viewed as a scalable O(n) solver for

arXiv:1904.05838v2
fatcat:artym4ja6jf3nkuykmmatqmlum
*sparse*linear systems. ... This work introduces a*parallel*implementation of AMG to reduce the cost of communication, yielding an increase in scalability. ... ACKNOWLEDGMENTS This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070*and*ACI-1238993)*and*the state ...##
###
Low-overhead load-balanced scheduling for sparse tensor computations

2014
*
2014 IEEE High Performance Extreme Computing Conference (HPEC)
*

We achieve around 4-5x improvement in performance over existing

doi:10.1109/hpec.2014.7041006
dblp:conf/hpec/BaskaranML14
fatcat:g7srlfdtrvgltda4fiqrrsmwbi
*parallel*approaches*and*observe "scalable"*parallel*performance*on*modern multicore systems with up to 32*processor**cores*. ...*and*data*locality*. ... , X → X (n) •*Matrix**transpose*• Inverse of a diagonal*matrix*• Pseudo-inverse of a*matrix*, V † •*Matrix**matrix**multiplication*, C = AB •*Matrix*Khatri-Rao product, C = A ⊙ B •*Matrix*element-wise division ...##
###
Acceleration of GPU-based Krylov solvers via data transfer reduction

2015
*
The international journal of high performance computing applications
*

designing a graphics processing unit specific

doi:10.1177/1094342015580139
fatcat:nu6cfe5puzes3cchq4rvj7teju
*sparse**matrix*-*vector*product kernel that is able to more efficiently use the graphics processing unit's computing power. ... Considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as*sparse**matrix*-*vector*, are crucial ... The common practice in*parallel**matrix*-*vector**multiplication*is to assign a fixed set of rows/columns to each processing unit for the A non-*transpose*/*transpose*cases, respectively. ...
