16,768 Hits in 4.9 sec

A general algorithm for tiling the register level

M. Jiménez, J. M. Llabería, A. Fernández, E. Morancho
1998 Proceedings of the 12th international conference on Supercomputing - ICS '98  
In this paper we present a new general algorithm to perform tiling for the register level in more than one dimension in both rectangular and nonrectangular iteration spaces.  ...  Tiling is a well-known loop transformation that can be used to exploit data reuse at the register level and to improve a program's ILP.  ...  In this paper we have presented a new general method that performs tiling for the register level.  ... 
doi:10.1145/277830.277859 dblp:conf/ics/JimenezLFM98 fatcat:y6zj5ktajnbjfh6izyoilgxgyy

Compact multi-dimensional kernel extraction for register tiling

Lakshminarayanan Renganarayana, Uday Bondhugula, Salem Derisavi, Alexandre E. Eichenberger, Kevin O'Brien
2009 Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09  
We show that by using COMDEX as a pre-processing to register tiling we can (i) enable register tiling on complex loop structures and (ii) realize a significant performance improvement on a variety of codes  ...  Downstream optimizations such as register tiling (unroll-and-jam plus scalar promotion) typically provide a significant performance improvement.  ...  The input to the kernel extraction algorithm is a multi-level tiled (possibly imperfect) loop nest.  ... 
doi:10.1145/1654059.1654105 dblp:conf/sc/RenganarayanaBDEO09 fatcat:wcvfqonpr5h6rcyvuixudh26uu

A cost-effective implementation of multilevel tiling

M. Jimenez, J.M. Llaberia, A. Fernandez
2003 IEEE Transactions on Parallel and Distributed Systems  
Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level.  ...  This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest).  ...  ACKNOWLEDGMENTS This work was supported by the Ministry of Education and Science of Spain (CICYT TIC2001-0995-C02-01).  ... 
doi:10.1109/tpds.2003.1239869 fatcat:g6tghe5zubfofdfchgcfrpui6i

A unified transformation technique for multilevel blocking [chapter]

M. Jiménez, J. M. Llabería, A. Fernández, E. Morancho
1996 Lecture Notes in Computer Science  
This paper presents a new unified method for simultaneously tiling the register and cache levels of the memory hierarchy. We will only focus on the code transformation phase of tiling.  ...  Our algorithm uses strip-mining and loop interchange on all memory hierarchy levels to determine the tiles as usual, and, afterwards, and due to the special characteristics of the register level, we apply  ...  Acknowledgments This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-0429/95).  ... 
doi:10.1007/3-540-61626-8_53 fatcat:w5sdcaaxova75ogkankkxus6hq

Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study

Chun Chen, Jaewook Shin, Shiva Kintali, Jacqueline Chame, Mary Hall
2007 2007 IEEE International Parallel and Distributed Processing Symposium  
In this paper, we describe a compiler that combines optimization across all levels of the memory hierarchy with automatic generation of SIMD code for multimedia extensions.  ...  Compiler technology for multimedia extensions must effectively utilize not only the SIMD compute engines but also the various levels of the memory hierarchy: superword registers, multi-level caches and  ...  For each level of the memory hierarchy, from registers to the last cache level (and also considering TLB), the algorithm identifies a set of array references and a loop carrying temporal reuse for the  ... 
doi:10.1109/ipdps.2007.370641 dblp:conf/ipps/ChenSKCH07 fatcat:awdgu3nk45dwdcvxoejdavsfey

Locality Optimization for Data Parallel Programs [article]

Eric Hielscher, Alex Rubinsteyn, Dennis Shasha
2013 arXiv   pre-print
Applying this transformation once tiles the program for cache, and applying it again enables tiling for registers.  ...  We introduce a novel tiling transformation to generate tiled operators automatically.  ...  We leave this for future work. TILE SIZES We run our transformation twice, once to general a level of tiles for the L1 cache, and once to generate tiles for registers.  ... 
arXiv:1304.1835v1 fatcat:npsyicogqfhozjsptavcc3wzta

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization [chapter]

Chun Chen, Jacqueline Chame, Mary Hall, Kristina Lerman
2006 Lecture Notes in Computer Science  
The goal of this work is a systematic approach to compiler optimization for simultaneously optimizing across multiple levels of the memory hierarchy.  ...  In previous work, we propose a compiler algorithm for deriving a set of parameterized solutions, followed by a modelguided empirical search to determine the best integer parameter values and select the  ...  Conclusion This paper shows how the problem of optimizing for multiple levels of the memory hierarchy can be recast as a multi-variable optimization problem.  ... 
doi:10.1007/978-3-540-69330-7_30 fatcat:6akstbukbvbf3n6jgbrcp52tqy

Multi-level tiling

DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, Michelle Mills Strout
2007 Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07  
We present an algorithm that can generate multi-level parameterized tiled loops at the same cost as generating single-level tiled loops.  ...  The efficiency of our method is demonstrated on several benchmarks. We also present a method-useful in register tiling-for separating partial and full tiles at any arbitrary level of tiling.  ...  For example, for a 2-level tiling in the context of caches and registers an innerlevel of tiling might be preferred.  ... 
doi:10.1145/1362622.1362691 dblp:conf/sc/KimRRRS07 fatcat:htimkznv6jfubbxmqguypdxn7u

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs [article]

Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn
2018 arXiv   pre-print
We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms.  ...  We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen.  ...  We discuss and analyze the performance of our algorithms through modeling in Section 5.  ... 
arXiv:1808.07984v1 fatcat:nc5htjv6xnhuzn2dxke4366k44

Applications of storage mapping optimization to register promotion

Patrick Carribault, Albert Cohen
2004 Proceedings of the 18th annual international conference on Supercomputing - ICS '04  
Register tiling is a complementary approach to exhibit scalar reuse at different depths in a loop nest; it plays the first role in optimization strategies for many memory-bound kernels from numerical,  ...  Our work is motivated by the empirical study of a computational biology benchmark, the approximate string matching algorithm BPR from NR-grep, on a wide issue micro-architecture.  ...  We extend a folding technique to better handle tiled iteration spaces and exploit the topmost level of the memory hierarchy.  ... 
doi:10.1145/1006209.1006244 dblp:conf/ics/CarribaultC04 fatcat:strvtxkj4bghnjb7ncw6la3bny

Accelerating kernel density estimation on the GPU using the CUDA framework

P. D. Michailidis, K. G. Margaritis
2013 Applied Mathematical Sciences  
In this work we discuss a naive and two optimised CUDA algorithms for the two kernel estimation methods: univariate and multivariate.  ...  We also present exploratory experimental results of the proposed CUDA algorithms according to the several values of parameters such as number of threads per block, tile size, loop unroll level, number  ...  Further, a second general conclusion, we can tell that optimal performance of CUDA algorithms 5 and 6 is achieved when there is a linear relation between the thread block size and tile size for any number  ... 
doi:10.12988/ams.2013.13133 fatcat:x5cc7z5ckbfmdoqhpbqh2nnyby

High-performance implementations of the Descartes method

Jeremy R. Johnson, Werner Krandick, Kevin Lynch, David G. Richardson, Anatole D. Ruslanov
2006 Proceedings of the 2006 international symposium on Symbolic and algebraic computation - ISSAC '06  
We present an implementation of the Bernstein-bases variant of the Descartes method that automatically generates architecture-aware high-level code and leaves further optimizations to the compiler.  ...  The first variant uses Taylor shift by 1 as its main subalgorithm, the second uses de Casteljau's algorithm.  ...  The register tiling techniques carry over to de Casteljau's algorithm. We implement the Descartes method using de Casteljau's algorithm with register tiling.  ... 
doi:10.1145/1145768.1145797 dblp:conf/issac/JohnsonKLRR06 fatcat:cho3mdcirraibi5bstuotb6pzu

Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores [article]

Junkyeong Choi, Hyucksung Kwon, Woongkyu Lee, Jungwook Choi, Jieun Lim
2022 arXiv   pre-print
The search space also includes options of register-level packing and layout optimization to lesson overhead of handling reduced-precision data.  ...  Finally, we propose a search algorithm to find the best schedule by learning from the distinctive candidates.  ...  That is, NHWC data layout should be reshaped into NHWCnc layout where 'n' stands for the input feature map batch size as a row size of WMMA register tile and 'c' stands for the input feature map channel  ... 
arXiv:2202.06819v2 fatcat:nax323vvvjhype3d5fjhhqfdqm

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Vasilios Kelefouras
2017 Computing  
The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts.  ...  /binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other.  ...  The initial search space is shown in Fig. 1 ; for a two level cache architecture it includes one level of tiling (tiling for the L1 or L2 cache), 2 levels of tiling (tiling for both L1 and L2 cache) and  ... 
doi:10.1007/s00607-016-0535-4 fatcat:kgw2ys3qfzbknbsr564pvdulsu

Parametric multi-level tiling of imperfectly nested loops

Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, P. Sadayappan
2009 Proceedings of the 23rd international conference on Conference on Supercomputing - ICS '09  
Tiling is a crucial loop transformation for generating high performance code on modern architectures.  ...  The tiling technique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling.  ...  We thank Lakshminarayanan Renganarayanan for valuable feedback that helped improve the presentation of the paper. This work was supported in part by the U.S.  ... 
doi:10.1145/1542275.1542301 dblp:conf/ics/HartonoBBCKNRS09 fatcat:utxnnulxobdvng6jpdvw4rvrkq
« Previous Showing results 1 — 15 out of 16,768 results