Filters








1,575 Hits in 1.6 sec

Cache-efficient matrix transposition

S. Chatterjee, S. Sen
Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)  
We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large.  ...  Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms  ...  model very efficiently.  ... 
doi:10.1109/hpca.2000.824350 dblp:conf/hpca/ChatterjeeS00 fatcat:7rl5ck2arzbmtbqx2erfvaj6ne

Cache Oblivious Matrix Transpositions using Sequential Processing

korde P.S.
2013 IOSR Journal of Engineering  
The optimal cache oblivious matrix transpositions makes Ο (1+N 2 / B) cache misses. In this paper we implement divide and conquer based algorithm for matrix transposition through recursive process.  ...  Matrix transpositions is a fundamental operation in linear algebra and in Fast Fourier transforms and applications in numerical analysis, image processing and graphics.  ...  Sequential Processing for Matrix Transposition Algorithm which give better performance as cache efficiency. Table 1 : 1 3×3 Cache miss ratio Sr.no.  ... 
doi:10.9790/3021-031145055 fatcat:vukegafjkzhrhpnthiezmywlti

Cache Oblivious Matrix Transposition: Simulation and Experiment [chapter]

Dimitrios Tsifakis, Alistair P. Rendell, Peter E. Strazdins
2004 Lecture Notes in Computer Science  
A cache oblivious matrix transposition algorithm is implemented and analyzed using simulation and hardware performance counters.  ...  Contrary to its name, the cache oblivious matrix transposition algorithm is found to exhibit a complex cache behavior with a cache miss ratio that is strongly dependent on the associativity of the cache  ...  An exception to this is a paper by Chatterjee and Sen (C&S) [5] on "Cache-Efficient Matrix Transposition".  ... 
doi:10.1007/978-3-540-24687-9_3 fatcat:egvfpqgrnjedhgvysjv4ylpo6u

Combining analytical and empirical approaches in tuning matrix transposition

Qingda Lu, Sriram Krishnamoorthy, P. Sadayappan
2006 Proceedings of the 15th international conference on Parallel architectures and compilation techniques - PACT '06  
Matrix transposition is an important kernel used in many applications.  ...  INPUT PARAMETERS Our objective is to generate an efficient implementation of the matrix transposition operation.  ...  There have been studies on how to achieve space-efficiency in matrix transposition or its more generalized forms [1, 14, 2, 7] . Our present work does not handle in-place transposition.  ... 
doi:10.1145/1152154.1152190 dblp:conf/IEEEpact/LuKS06 fatcat:dtfss24gpfe7znuzk2hdpexcvy

Ideal and Predictable Hit Ratio for Matrix Transposition in Data Caches

Alba Pedro-Zapater, Clemente Rodríguez, Juan Segarra, Rubén Gran Gran Tejero, Víctor Viñals-Yúfera
2020 Mathematics  
Matrix transposition is a fundamental operation, but it may present a very low and hardly predictable data cache hit ratio for large matrices.  ...  We also analyze the energy consumption and execution time of matrix transposition on real hardware with pseudo-LRU (PLRU) caches.  ...  Cache-Efficient Matrix Transposition Reference [3] describes several matrix transposition algorithms and compares their performance using both simulation and real execution on a Sun UltraSPARC II based  ... 
doi:10.3390/math8020184 fatcat:7ig76wmj65hvbnzcbyra33wpwy

A decomposition for in-place matrix transposition

Bryan Catanzaro, Alexander Keller, Michael Garland
2014 SIGPLAN notices  
We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors.  ...  Traditional approaches to inplace matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than  ...  Cache-aware Rotate We can improve the performance of column rotations on the array by ensuring all cache-lines read and written to and from memory are utilized efficiently.  ... 
doi:10.1145/2692916.2555253 fatcat:zdlat5rknrb7fi7v26s3ovjepy

A decomposition for in-place matrix transposition

Bryan Catanzaro, Alexander Keller, Michael Garland
2014 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14  
We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors.  ...  Traditional approaches to inplace matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than  ...  Cache-aware Rotate We can improve the performance of column rotations on the array by ensuring all cache-lines read and written to and from memory are utilized efficiently.  ... 
doi:10.1145/2555243.2555253 dblp:conf/ppopp/CatanzaroKG14 fatcat:v3gi2u6u35f5razpa4fd4rrmgy

Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture [article]

Christian Knauth, Boran Adas, Daniel Whitfield, Xuesong Wang, Lydia Ickler, Tim Conrad, Oliver Serang
2017 arXiv   pre-print
approach, which reduces the bit-reversed permutation to smaller bit-reversed permutations and a square matrix transposition.  ...  matrix buffer.  ...  When paired with an optimal cache-oblivious method for matrix transposition [13] , it guarantees fairly contiguous memory accesses without any knowledge of the cache architecture.  ... 
arXiv:1708.01873v1 fatcat:vx3zpajytrcf7o3hyyk6weozum

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Marcin Gorawski, Michal Lorek
2017 International journal of parallel programming  
In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered-an improved version of the in-place algorithm for square matrices, outof-place algorithm for  ...  The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache.  ...  They demonstrated that by using a Morton layout their cache efficient transposition algorithm offered better performance than other canonical layouts.  ... 
doi:10.1007/s10766-017-0515-0 fatcat:7cqfwvx3ubhefhdsohxgwb42cq

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators

Ardavan Pedram, Andreas Gerstlauer, Robert A. van de Geijn
2012 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing  
, such as GEMM, SYRK and matrix transposition.  ...  Linear algebra computations can be efficiently reduced down to a canonical set of Basic Linear Algebra Subroutines (BLAS), such as matrix-matrix and matrix-vector operations [7] .  ...  Since the cores do not transpose matrices efficiently, the overall SYRK performance is limited by the matrix transposition behavior.  ... 
doi:10.1109/sbac-pad.2012.35 dblp:conf/sbac-pad/PedramGG12 fatcat:uprnlnt7ffarxc4j6zwv7omdru

Algorithms for In-Place Matrix Transposition [chapter]

Fred G. Gustavson, David W. Walker
2014 Lecture Notes in Computer Science  
CCDSC 2014 Pros and Cons • Cycle-based matrix transposition is elegant and simple to implement. • Has irregular memory access patterns so does not use cache efficiently.  ...  Do this by: -1D transform wrt each dimension in turn -Transpose wrt 2 of the dimensions after each 1D transform -Makes efficient use of cache -Facilitates parallelization Column-Major Ordering • In CMO  ... 
doi:10.1007/978-3-642-55195-6_10 fatcat:ihuxiqi33rdzpk3ppzwllf6a2u

Modeling set associative caches behavior for irregular computations

Basilio B. Fraguela, Ramón Doallo, Emilio L. Zapata
1998 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '98/PERFORMANCE '98  
Two different irregular kernels are considered: the sparse matrix-vector product and the transposition of a sparse matrix.  ...  While much work has been devoted to the study of cache behavior during the execution of codes with regular access patterns, little attention has been paid to irregular codes.  ...  the transposition of a sparse-matrix.  ... 
doi:10.1145/277851.277910 dblp:conf/sigmetrics/FraguelaDZ98 fatcat:qyhohgwadndctgw62bhbx7v7sy

An Efficient Dual-Channel Data Storage and Access Method for Spaceborne Synthetic Aperture Radar Real-Time Processing

Guoqing Wang, He Chen, Yizhuang Xie
2021 Electronics  
However, the characteristics of external memory have led to matrix transposition becoming a technical bottleneck that limits the real-time performance of the SAR imaging system.  ...  The experimental results show that the reading efficiency of the data controller proposed is 80% both in the range direction and azimuth direction, and the writing efficiency is 66% both in the range direction  ...  80% 69% 74% 93.75% - 83% Azimuth access efficiency 80% 80% 74% 74% - 83% Matrix transposition time 0.43 s 0.45 s 0.83 s 1.18 s 0.33 s - Cache RAM number 4 4 4 8 - - Pipeline  ... 
doi:10.3390/electronics10060662 fatcat:d3cicnehkjdqjgalw3hvenfeay

Modeling set associative caches behavior for irregular computations

Basilio B. Fraguela, Ramón Doallo, Emilio L. Zapata
1998 Performance Evaluation Review  
Two different irregular kernels are considered: the sparse matrix-vector product and the transposition of a sparse matrix.  ...  While much work has been devoted to the study of cache behavior during the execution of codes with regular access patterns, little attention has been paid to irregular codes.  ...  the transposition of a sparse-matrix.  ... 
doi:10.1145/277858.277910 fatcat:tkecl4izb5b7xkrb4jureijrta

Cache Oblivious Algorithms [chapter]

Piyush Kumar
2003 Lecture Notes in Computer Science  
Acknowledgements The author would like to thank Michael Bender, Matteo Frigo, Joe Mitchell, Edgar Ramos and Peter Sanders for discussions on cache obliviousness and to MPI Informatik, Saarbrücken, Germany  ...  Here is the C/C++ code for cache oblivious matrix transposition.  ...  In section 4 we choose matrix transposition as an example to learn the practical issues in cache oblivious algorithm design.  ... 
doi:10.1007/3-540-36574-5_9 fatcat:5h62jw67wvgllc3ll66glbgria
« Previous Showing results 1 — 15 out of 1,575 results