A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2018; you can also visit the original URL.
The file type is application/pdf
.
Filters
Data Layout Optimizations for Variable Coefficient Multigrid
[chapter]
2002
Lecture Notes in Computer Science
We focus on data layout techniques to enhance the cache efficiency of multigrid codes for variable coefficient problems on regular meshes. ...
This paper is based on our previous work on data access transformations for multigrid methods for constant coefficient problems. ...
Semantics-maintaining cache optimization techniques for constant coefficient problems on structured grids have been studied extensively in our DiME 1 project [9, 16] . ...
doi:10.1007/3-540-47789-6_67
fatcat:x4rw6iro6zfpzfcda33ltxsn3a
Line size adaptivity analysis of parameterized loop nests for direct mapped data cache
2005
IEEE transactions on computers
We present an approach that enables the quantification of data misses with respect to cache-line size at compile-time using (parametric) equations, which model interference. ...
We examine efficient utilization of data caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of cache-line size adaptivity. ...
They helped on Ehrhart polynomials and the existence test, cache miss equation determination, interference estimation, and moral/technical support, respectively. ...
doi:10.1109/tc.2005.28
fatcat:2fxscz5iizhdlisriltjbql5yu
An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations
[article]
2021
arXiv
pre-print
The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. ...
The cache is implemented as a circular list that maintains the BRAM vector components while needed. ...
The cache vector's optimal size is calculated from the width of the band obtained from the Cuthill-Mckee reordering. ...
arXiv:2107.12371v1
fatcat:qrtzyewdo5apxni6n2fe2hptem
The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra
[chapter]
1998
Lecture Notes in Computer Science
We also tackle the performance portability problem for particular architecture dependent algorithms such as matrix-matrix multiply. ...
We present a unified approach for expressing high performance numerical linear algebra routines for large classes of dense and sparse matrices. ...
The authors would like to express their appreciation to Tony Skjellum and Puri Bangalore for numerous helpful discussions. ...
doi:10.1007/3-540-49372-7_6
fatcat:moqousba7vc2nicuohplcbv23y
Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format
[article]
2022
arXiv
pre-print
Due to ill performance on many devices, sparse matrix-vector multiplication (SpMV) normally requires special care to store and tune for a given device. ...
Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super-super-rows sizes in constant time. ...
CSR-k is compiled on system 3 with the AMD Optimizing C Compiler (AOCC) v2.3.0 with -O3 optimization and -march=znver2. ...
arXiv:2203.05096v2
fatcat:dlwb447tsrfzxogvlyygx5s2zy
ATLAS Version 3.9: Overview and Status
[chapter]
2010
Software Automatic Tuning
ATLAS produces a full BLAS (Basic Linear Algebra Subprograms) library as well as providing some optimized routines for LAPACK (Linear Algebra PACKage). ...
ATLAS is an instantiation of a paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been ...
We plan on providing a series of timers that can be used to empirically find good parameters for ILAENV for the more important LAPACK routines across a range of problem sizes. ...
doi:10.1007/978-1-4419-6935-4_2
fatcat:6u5m2itzmffvvcpuvhtfxmc47e
Sparsity: Optimization Framework for Sparse Matrix Kernels
2004
The international journal of high performance computing applications
Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the nonzero structure is random. ...
Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. ...
Acknowledgement We would like to thank Osni Marques for providing us a web document matrix, Tuyet-Linh Phan for her help with the data collection, and Jim Demmel for discussions on algorithms that use ...
doi:10.1177/1094342004041296
fatcat:hjien6e4hjg5vnrrqyasjhlyde
NUMA Aware Iterative Stencil Computations on Many-Core Systems
2012
2012 IEEE 26th International Parallel and Distributed Processing Symposium
into many independent tasks, and datato-core affinity for NUMA-aware data distribution. ...
Results are presented on an 8 socket dual-core and a 4 socket oct-core systems and compared against an optimized naive scheme, various peak performance characteristics, and related schemes from literature ...
ACKNOWLEDGMENT We would like to thank Yuan Tang and the Pochoir stencil compiler team for granting us an early access to their code for testing. ...
doi:10.1109/ipdps.2012.50
dblp:conf/ipps/ShaheenS12
fatcat:cou7th27k5dh5fhtnq5b5bip4i
Improving Image Processing Systems by Using Software Simulated LRU Cache Algorithms
2012
Informatică economică
A solution needed to be devised to overcome this easy problem at first, but complex in implementation. ...
We can adjust this concept in software programming by identifying the problem and coming up with an implementation. ...
be used for a one band tile: * 2 b) Calculate maximum tile width using the max-imumArea c) Calculate maximum tile height using the max-imumArea d) Calculating actualCacheUnitSize * * * 3 Each cache unit ...
doaj:ed1f764a227344298aeaabafc9fc30fb
fatcat:23re3iohgvfmvblhpcxfhgutri
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
[article]
2018
arXiv
pre-print
DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated ...
Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations ...
Acknowledgements We are grateful for the numerous discussions and fruitful ongoing collaboration with the following people: Tianqi Chen, Moustapha Cissé, Cijo Jose, Chandan Reddy, Will Feng, Edward Yang ...
arXiv:1802.04730v3
fatcat:2ef5ete4mvao5bz43h7z7dtlwi
Performance Analysis of Effective Symbolic Methods for Solving Band Matrix SLAEs
2019
EPJ Web of Conferences
This paper presents an experimental performance study of implementations of three symbolic algorithms for solving band matrix systems of linear algebraic equations with heptadiagonal, pentadiagonal, and ...
The only assumption on the coefficient matrix in order for the algorithms to be stable is nonsingularity. ...
However, in the context of symbolic computations for solving a SLAE with a band coefficient matrix (length of band equal to 3, 5 or 7) among the options suggested in this work, we can note the following ...
doi:10.1051/epjconf/201921405004
fatcat:itleods7wbadvdf56aemzxa3su
DXML: A High-performance Scientific Subroutine Library
1994
Digital technical journal of Digital Equipment Corporation
We would also like to thank Roger Grimes at Boeing Computer Services for making the Harwell-Boeing matrices so readily available. ...
BLAS library. [10] LAPACK can be used for solving many common linear algebra problems, including solution of linear systems, linear least-squares problems, eigenvalue problems, and singular value problems ...
Both versions are written in standard Fortran and compiled using identical compiler options. Optimization of BLAS 1 BLAS 1 routines operate on vector and scalar data only. ...
dblp:journals/dtj/KamathHM94
fatcat:ibyqjwgy3vesrg2sl3nzxavblq
Performance of an Astrophysical Radiation Hydrodynamics Code under Scalable Vector Extension Optimization
[article]
2022
arXiv
pre-print
We explored several compilers and performance analysis packages and found the code did not perform as expected under scalable vector extension optimization, suggesting that a "deeper dive" into analyzing ...
The code solves sparse linear systems, a task for which the A64FX architecture should be well suited. ...
ACKNOWLEDGMENT The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the innovative ...
arXiv:2207.13251v1
fatcat:vdx3oj7ipzg2nnzcjwil5verz4
Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures
2018
IEEE Transactions on Parallel and Distributed Systems
We also thank the Intel Corporation for their generous hardware donation and continuous financial support and the Oak Ridge Leadership Computing Facility for providing access to the ARMv8 and POWER8 systems ...
ACKNOWLEDGMENTS The authors would like to thank the members of the PLASMA project for the valuable discussions. ...
Such linear solvers are also needed for unconstrained or constrained optimization problems or for solving the augmented system for general least squares discretized-incompressible Navier-Stokes equations ...
doi:10.1109/tpds.2018.2808964
fatcat:yomrv5hnonfcjhzkmcneoqk6gq
Layout-oblivious compiler optimization for matrix computations
2013
ACM Transactions on Architecture and Code Optimization (TACO)
of compiler optimizations. ...
to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. ...
Our approach solves this problem and enables alternative implementations of the same matrix computation to benefit from a common set of optimizations. ...
doi:10.1145/2400682.2400694
fatcat:24fhy46qvbgc7g3webltf4b5hi
« Previous
Showing results 1 — 15 out of 2,932 results