14,820 Hits in 3.0 sec

A Comparison of Compiler Tiling Algorithms [chapter]

Gabriel Rivera, Chau-Wen Tseng
1999 Lecture Notes in Computer Science  
Comparing the e cacy of di erent tiling algorithms, we discover rectangular tiles are slightly more e cient than square tiles. Overall, tiling improves performance from 0-250%.  ...  Results show padding improves performance of matrix multiply by over 100% in some cases over a range of matrix sizes.  ...  In general, this can occur only at the rst invocation, and a simple comparison with C ol s will prevent consideration of such tile sizes.  ... 
doi:10.1007/978-3-540-49051-7_12 fatcat:cne6thlra5bk5damoozthrx7pu

A script-based autotuning compiler system to generate high-performance CUDA code

Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
This system achieves performance comparable and sometimes better than manually tuned libraries and exceeds the performance of a state-of-the-art GPU compiler.  ...  This article presents a novel compiler framework for CUDA code generation.  ...  Performance Comparison with State-of-the-Art GPU Compiler We applied our system to a set of optimized benchmarks generated by a state-of-the-art compiler presented in Baskaran et al. [2010] .  ... 
doi:10.1145/2400682.2400690 fatcat:aqluktapgbhufnsy5l4qjp4bra

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Arslan Munir, Farinaz Koushanfar, Ann Gordon-Ross, Sanjay Ranka
2013 Journal of Supercomputing  
Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.  ...  We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing  ...  Furthermore, the views expressed are those of the author(s) and do not reflect the official policy or position of the Department of Defense or the US Government. We would like to acknowledge Dr.  ... 
doi:10.1007/s11227-013-0916-9 fatcat:wi7tizdsdvhuhd2fwxxai65iem

Parallel cache-efficient code for computing the McCaskill partition functions

Marek Pałkowski, Włodzimierz Bielecki
2019 Proceedings of the 2019 Federated Conference on Computer Science and Information Systems  
A TRACO tiling strategy uses the transitive closure of a dependence graph to avoid affine function calculation. The ISL scheduler is used to parallelize tiled loop nests.  ...  To optimize code, we use the authorial source-to-source TRACO compiler and compare obtained code performance to that generated with the state-of-the-art PluTo compiler based on the affine transformations  ...  EXPERIMENTAL STUDY This section presents the results of the comparison of the performance of TRACO and PLuTo tiled codes implementing McCaskill's algorithm.  ... 
doi:10.15439/2019f8 dblp:conf/fedcsis/PalkowskiB19 fatcat:ng45r6gvrrcdpgzbre6vqa46ti

High-performance implementations of the Descartes method

Jeremy R. Johnson, Werner Krandick, Kevin Lynch, David G. Richardson, Anatole D. Ruslanov
2006 Proceedings of the 2006 international symposium on Symbolic and algebraic computation - ISSAC '06  
We present an implementation of the Bernstein-bases variant of the Descartes method that automatically generates architecture-aware high-level code and leaves further optimizations to the compiler.  ...  We compare the performance of our implementation, algorithmically tuned implementations of the monomial and Bernstein variants, and architecture-unaware implementations of both variants on four different  ...  Tiled Bernstein. The tiled Bernstein method is compiled with the same compilers and flags as the SACLIB methods.  ... 
doi:10.1145/1145768.1145797 dblp:conf/issac/JohnsonKLRR06 fatcat:cho3mdcirraibi5bstuotb6pzu

Virtualization of heterogeneous machines hardware description in a synthesizable object-oriented language

Joshua Auerbach, David F. Bacon, Perry Cheng, Rodric Rabbah, Sunil Shukla
2011 Proceedings of the 48th Design Automation Conference on - DAC '11  
This paper illustrates the salient synthesis-oriented features of the language using a photomosaic algorithm with inherent bit, pipeline, and data parallelism.  ...  Lime is a new Java-compatible and object-oriented language designed to make programming of reconfigurable hardware significantly more accessible to skilled software developers.  ...  We also implemented a native C and Verilog version of the scoring algorithm for future performance comparisons.  ... 
doi:10.1145/2024724.2024923 dblp:conf/dac/AuerbachBCRS11 fatcat:4rfeclmcmraxfbw4bxezrgsdnq

On the Interaction of Tiling and Automatic Parallelization [chapter]

Zhelong Pan, Brian Armstrong, Hansang Bae, Rudolf Eigenmann
2008 Lecture Notes in Computer Science  
In an effort to include a tiling pass into an advanced parallelizing compiler, we have found that the interaction of tiling and parallelization raises unexplored issues.  ...  Iteration space tiling is a well-explored programming and compiler technique to enhance program locality.  ...  Section 4 presents the algorithm for tiling in concert with parallelism and discusses related issues arising in a parallelizing compiler.  ... 
doi:10.1007/978-3-540-68555-5_3 fatcat:u23cizotebfb5ihuqcmqesof44

Introducing 'Bones'

Cedric Nugteren, Henk Corporaal
2012 Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units - GPGPU-5  
The compiler generates target code based on skeletons of parallel structures, which can be seen as parameterisable library implementations for a set of algorithm classes.  ...  This classification is used in a new source-to-source compiler, which is based on the algorithmic skeletons technique.  ...  He is a graduate student at Eindhoven University of Technology.  ... 
doi:10.1145/2159430.2159431 dblp:conf/asplos/NugterenC12 fatcat:p2zsjdbannfwngjqetl2j4k7ne

Flextended Tiles

Jie Zhao, Albert Cohen
2019 ACM Transactions on Architecture and Code Optimization (TACO)  
Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers.  ...  Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom  ...  ACKNOWLEDGMENTS This work was partly supported by the National Natural Science Foundation of China under Grant No. 61702546, and the European Commission through the MNEMOSENE project id. 780215.  ... 
doi:10.1145/3369382 fatcat:rofcrtjlbre2zpoilctdvbklge

A methodology for speeding up loop kernels by exploiting the software information and the memory architecture

Vasilios Kelefouras, Angeliki Kritikakou, Costas Goutis
2015 Computer languages, systems & structures  
Second, they take into account only part of the specific algorithms information. Third, they take into account only a few hardware architecture parameters.  ...  It is well-known that today's compilers and state of the art libraries have three major drawbacks.  ...  This research has been co-financed by the European Union (European Social Fund ESF) and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic  ... 
doi:10.1016/ fatcat:giverm6gvvbtpcnjap77xjo3de

Accelerating Minimum Cost Polygon Triangulation Code with the TRACO Compiler

Marek Pałkowski, Wlodzimierz Bielecki
2018 Communication Papers of the 2018 Federated Conference on Computer Science and Information Systems  
First, the code is tiled by means of the transitive closure of a dependence graph. TRACO allows for tiling of the innermost loop nest that is not possible by means of other closely related compilers.  ...  MCPT is a recursive algorithm encountering each subproblem many times in different branches of its recursion tree.  ...  EXPERIMENTAL STUDY This section presents the results of the comparison of TRACO and PLuTo tiled code performance.  ... 
doi:10.15439/2018f8 dblp:conf/fedcsis/PalkowskiB18 fatcat:rjaa7gidubhvribnhvgir6nl3m

A multi-objective auto-tuning framework for parallel codes

Herbert Jordan, Peter Thoman, Juan J. Durillo, Simone Pellegrini, Philipp Gschwandtner, Thomas Fahringer, Hans Moritsch
2012 2012 International Conference for High Performance Computing, Networking, Storage and Analysis  
Focusing on individual code regions, our compiler uses a novel search technique to compute a set of optimal solutions, which are encoded into a multi-versioned executable.  ...  In this paper we introduce a multi-objective autotuning framework comprising compiler and runtime components.  ...  In this paper, we propose a novel multi-objective optimization algorithm to be used within our iterative compiler framework. We refer to this algorithm as RS-GDE3.  ... 
doi:10.1109/sc.2012.7 dblp:conf/sc/JordanTBPGFM12 fatcat:7gq43aun75dcpnpvady63cmmfy

GPU-accelerated Chemical Similarity Assessment for Large Scale Databases

Marco Maggioni, Marco Domenico Santambrogio, Jie Liang
2011 Procedia Computer Science  
In our work, we present a general GPU algorithm for all-to-all chemical comparisons considering both binary fingerprints and floating point descriptors as molecule representation.  ...  We test the proposed algorithm on different experimental setups, a laptop with a low-end GPU and a desktop with a more performant GPU.  ...  Sliding-tile technique The GPU algorithm can be further improved by considering sliding tiles. Until now we have considered a biunivocal relation between GPU threads and similarity comparisons.  ... 
doi:10.1016/j.procs.2011.04.219 pmid:27774113 pmcid:PMC5072535 fatcat:qmpuxuokbfgodcfqssefx2gnxm

Performance portable GPU code generation for matrix multiplication

Toomas Remmelg, Thibaut Lutz, Michel Steuwer, Christophe Dubach
2016 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit - GPGPU '16  
We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way.  ...  Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -but provably correct -implementation of matrix multiplication.  ...  The tile function is used twice to tile both matrices in the last two lines. Line 11 combines a row of tiles of matrix A with a column of tiles of matrix B using the zip primitive.  ... 
doi:10.1145/2884045.2884046 dblp:conf/ppopp/RemmelgLSD16 fatcat:udqfkh2fb5gclgifnakv6jpujq

Optimizing graph algorithms for improved cache performance

J.-S. Park, M. Penner, V.K. Prasanna
2004 IEEE Transactions on Parallel and Distributed Systems  
Tiling has long been used to improve cache performance. Recursion has recently been used as a cache-oblivious method of improving cache performance.  ...  For these algorithms, we demonstrate up to a 2x improvement by using a cache friendly graph representation.  ...  Figures 11 & 12 show a comparison of the best Floyd-Warshall algorithm with Dijkstra's algorithm for sparse graphs.  ... 
doi:10.1109/tpds.2004.44 fatcat:knegwttiirgzbon755jkezo3ni
« Previous Showing results 1 — 15 out of 14,820 results