2,932 Hits in 4.0 sec

Data Layout Optimizations for Variable Coefficient Multigrid [chapter]

Markus Kowarschik, Ulrich Rüde, Christian Weiß
2002 Lecture Notes in Computer Science  
We focus on data layout techniques to enhance the cache efficiency of multigrid codes for variable coefficient problems on regular meshes.  ...  This paper is based on our previous work on data access transformations for multigrid methods for constant coefficient problems.  ...  Semantics-maintaining cache optimization techniques for constant coefficient problems on structured grids have been studied extensively in our DiME 1 project [9, 16] .  ... 
doi:10.1007/3-540-47789-6_67 fatcat:x4rw6iro6zfpzfcda33ltxsn3a

Line size adaptivity analysis of parameterized loop nests for direct mapped data cache

P. D'Alberto, A. Nicolau, A. Veidenbaum, Rajesh Gupta
2005 IEEE transactions on computers  
We present an approach that enables the quantification of data misses with respect to cache-line size at compile-time using (parametric) equations, which model interference.  ...  We examine efficient utilization of data caches in an adaptive memory hierarchy. We exploit data reuse through the static analysis of cache-line size adaptivity.  ...  They helped on Ehrhart polynomials and the existence test, cache miss equation determination, interference estimation, and moral/technical support, respectively.  ... 
doi:10.1109/tc.2005.28 fatcat:2fxscz5iizhdlisriltjbql5yu

An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations [article]

Guillermo Oyarzun, Daniel Peyrolon, Carlos Alvarez, Xavier Martorell
2021 arXiv   pre-print
The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations.  ...  The cache is implemented as a circular list that maintains the BRAM vector components while needed.  ...  The cache vector's optimal size is calculated from the width of the band obtained from the Cuthill-Mckee reordering.  ... 
arXiv:2107.12371v1 fatcat:qrtzyewdo5apxni6n2fe2hptem

The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra [chapter]

Jeremy G. Siek, Andrew Lumsdaine
1998 Lecture Notes in Computer Science  
We also tackle the performance portability problem for particular architecture dependent algorithms such as matrix-matrix multiply.  ...  We present a unified approach for expressing high performance numerical linear algebra routines for large classes of dense and sparse matrices.  ...  The authors would like to express their appreciation to Tony Skjellum and Puri Bangalore for numerous helpful discussions.  ... 
doi:10.1007/3-540-49372-7_6 fatcat:moqousba7vc2nicuohplcbv23y

Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format [article]

Phillip Allen Lane, Joshua Dennis Booth
2022 arXiv   pre-print
Due to ill performance on many devices, sparse matrix-vector multiplication (SpMV) normally requires special care to store and tune for a given device.  ...  Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super-super-rows sizes in constant time.  ...  CSR-k is compiled on system 3 with the AMD Optimizing C Compiler (AOCC) v2.3.0 with -O3 optimization and -march=znver2.  ... 
arXiv:2203.05096v2 fatcat:dlwb447tsrfzxogvlyygx5s2zy

ATLAS Version 3.9: Overview and Status [chapter]

R. Clint Whaley
2010 Software Automatic Tuning  
ATLAS produces a full BLAS (Basic Linear Algebra Subprograms) library as well as providing some optimized routines for LAPACK (Linear Algebra PACKage).  ...  ATLAS is an instantiation of a paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been  ...  We plan on providing a series of timers that can be used to empirically find good parameters for ILAENV for the more important LAPACK routines across a range of problem sizes.  ... 
doi:10.1007/978-1-4419-6935-4_2 fatcat:6u5m2itzmffvvcpuvhtfxmc47e

Sparsity: Optimization Framework for Sparse Matrix Kernels

Eun-Jin Im, Katherine Yelick, Richard Vuduc
2004 The international journal of high performance computing applications  
Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the nonzero structure is random.  ...  Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems.  ...  Acknowledgement We would like to thank Osni Marques for providing us a web document matrix, Tuyet-Linh Phan for her help with the data collection, and Jim Demmel for discussions on algorithms that use  ... 
doi:10.1177/1094342004041296 fatcat:hjien6e4hjg5vnrrqyasjhlyde

NUMA Aware Iterative Stencil Computations on Many-Core Systems

Mohammed Shaheen, Robert Strzodka
2012 2012 IEEE 26th International Parallel and Distributed Processing Symposium  
into many independent tasks, and datato-core affinity for NUMA-aware data distribution.  ...  Results are presented on an 8 socket dual-core and a 4 socket oct-core systems and compared against an optimized naive scheme, various peak performance characteristics, and related schemes from literature  ...  ACKNOWLEDGMENT We would like to thank Yuan Tang and the Pochoir stencil compiler team for granting us an early access to their code for testing.  ... 
doi:10.1109/ipdps.2012.50 dblp:conf/ipps/ShaheenS12 fatcat:cou7th27k5dh5fhtnq5b5bip4i

Improving Image Processing Systems by Using Software Simulated LRU Cache Algorithms

Cosmin CIORANU, Marius CIOCA, Lucian-Ionel CIOCA
2012 Informatică economică  
A solution needed to be devised to overcome this easy problem at first, but complex in implementation.  ...  We can adjust this concept in software programming by identifying the problem and coming up with an implementation.  ...  be used for a one band tile: * 2 b) Calculate maximum tile width using the max-imumArea c) Calculate maximum tile height using the max-imumArea d) Calculating actualCacheUnitSize * * * 3 Each cache unit  ... 
doaj:ed1f764a227344298aeaabafc9fc30fb fatcat:23re3iohgvfmvblhpcxfhgutri

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions [article]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen
2018 arXiv   pre-print
DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated  ...  Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations  ...  Acknowledgements We are grateful for the numerous discussions and fruitful ongoing collaboration with the following people: Tianqi Chen, Moustapha Cissé, Cijo Jose, Chandan Reddy, Will Feng, Edward Yang  ... 
arXiv:1802.04730v3 fatcat:2ef5ete4mvao5bz43h7z7dtlwi

Performance Analysis of Effective Symbolic Methods for Solving Band Matrix SLAEs

Milena Veneva, Alexander Ayriyan, A. Forti, L. Betev, M. Litmaath, O. Smirnova, P. Hristov
2019 EPJ Web of Conferences  
This paper presents an experimental performance study of implementations of three symbolic algorithms for solving band matrix systems of linear algebraic equations with heptadiagonal, pentadiagonal, and  ...  The only assumption on the coefficient matrix in order for the algorithms to be stable is nonsingularity.  ...  However, in the context of symbolic computations for solving a SLAE with a band coefficient matrix (length of band equal to 3, 5 or 7) among the options suggested in this work, we can note the following  ... 
doi:10.1051/epjconf/201921405004 fatcat:itleods7wbadvdf56aemzxa3su

DXML: A High-performance Scientific Subroutine Library

Chandrika Kamath, Roy Ho, Dwight P. Manley
1994 Digital technical journal of Digital Equipment Corporation  
We would also like to thank Roger Grimes at Boeing Computer Services for making the Harwell-Boeing matrices so readily available.  ...  BLAS library. [10] LAPACK can be used for solving many common linear algebra problems, including solution of linear systems, linear least-squares problems, eigenvalue problems, and singular value problems  ...  Both versions are written in standard Fortran and compiled using identical compiler options. Optimization of BLAS 1 BLAS 1 routines operate on vector and scalar data only.  ... 
dblp:journals/dtj/KamathHM94 fatcat:ibyqjwgy3vesrg2sl3nzxavblq

Performance of an Astrophysical Radiation Hydrodynamics Code under Scalable Vector Extension Optimization [article]

Dennis C. Smolarski, F. Douglas Swesty, Alan C. Calder
2022 arXiv   pre-print
We explored several compilers and performance analysis packages and found the code did not perform as expected under scalable vector extension optimization, suggesting that a "deeper dive" into analyzing  ...  The code solves sparse linear systems, a task for which the A64FX architecture should be well suited.  ...  ACKNOWLEDGMENT The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the innovative  ... 
arXiv:2207.13251v1 fatcat:vdx3oj7ipzg2nnzcjwil5verz4

Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures

Ichitaro Yamazaki, Jakub Kurzak, Panruo Wu, Mawussi Zounon, Jack Dongarra
2018 IEEE Transactions on Parallel and Distributed Systems  
We also thank the Intel Corporation for their generous hardware donation and continuous financial support and the Oak Ridge Leadership Computing Facility for providing access to the ARMv8 and POWER8 systems  ...  ACKNOWLEDGMENTS The authors would like to thank the members of the PLASMA project for the valuable discussions.  ...  Such linear solvers are also needed for unconstrained or constrained optimization problems or for solving the augmented system for general least squares discretized-incompressible Navier-Stokes equations  ... 
doi:10.1109/tpds.2018.2808964 fatcat:yomrv5hnonfcjhzkmcneoqk6gq

Layout-oblivious compiler optimization for matrix computations

Huimin Cui, Qing Yi, Jingling Xue, Xiaobing Feng
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
of compiler optimizations.  ...  to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies.  ...  Our approach solves this problem and enables alternative implementations of the same matrix computation to benefit from a common set of optimizations.  ... 
doi:10.1145/2400682.2400694 fatcat:24fhy46qvbgc7g3webltf4b5hi
« Previous Showing results 1 — 15 out of 2,932 results