A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2011; you can also visit the original URL.
The file type is application/pdf
.
Filters
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor
[chapter]
2009
Lecture Notes in Computer Science
The paper provides (a) an exploration of the possibility of developing pragma directives for semi-automatic data movement code generation in OpenMP; (b) an introduction of techniques used to implement ...
Currently, all OpenMP directives are only used to decompose computation code (such as loop iterations, tasks, code sections, etc.). ...
The speedup diminishing return point of SGEMV is 8-thread, while for SGEMM, it is 2-thread. For SASUM, its memory accesses and floating-point operations are the same. ...
doi:10.1007/978-3-642-03869-3_78
fatcat:6s6qil3finaobpnhziqls2mgjq
Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA
2013
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
Loop tiling is used to enable block communications, suitable for DDR memories. ...
Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory. ...
Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement. ...
doi:10.7873/date.2013.127
dblp:conf/date/AliasDP13
fatcat:ur4zoex4pfhjtihvoeg6rfpojm
Optimizing remote accesses for offloaded kernels
2012
SIGPLAN notices
Loop tiling is used to enable block communications, suitable for DDR memories. ...
Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory. ...
Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement. ...
doi:10.1145/2370036.2145856
fatcat:ij2tevq3ebhc3mvvqdwg25rviq
Optimizing remote accesses for offloaded kernels
2012
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
Loop tiling is used to enable block communications, suitable for DDR memories. ...
Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory. ...
Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement. ...
doi:10.1145/2145816.2145856
dblp:conf/ppopp/AliasDP12
fatcat:oxfjji66ajdudjfp2oblmyyxqe
An Approach for Semiautomatic Locality Optimizations Using OpenMP
[chapter]
2012
Lecture Notes in Computer Science
Our notion of tiled loops allows developers to easily describe data locality even at scenarios with non-trivial data dependencies. Furthermore, we describe two new optimization techniques. ...
As an additional contribution we explore the benefit of using multiple levels of tiling. ...
Making the notion of tiles available in OpenMP will not only enable developers to specify data locality and thereby increase performance on current CPUs, but lays out the foundation for future work to ...
doi:10.1007/978-3-642-28145-7_29
fatcat:pnlzpvfi65fajih63nchg7wqoi
Scaling Data-Intensive Applications on Heterogeneous Platforms with Accelerators
2012
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
O Tiling -Stand-alone or additional level of tiling in existing polyhedral frameworks • Mapping of tile access and communication code -Run-time support: • Tile streaming model -Asynchronous execution and ...
-Buffered communication and streaming to GPU
-Tiling / multi-dimensional strip-mining
-Decompose outer loop nest(s) into two loops
• Tile-loop
• Point-loop
-Interchange
-Coarse-grain parallelism ...
doi:10.1109/ipdpsw.2012.230
dblp:conf/ipps/BalevicK12
fatcat:w46lyu4cf5gpfj6eeym5xhm57y
Towards effective automatic parallelization for multicore systems
2008
Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)
Although virtually all production C compilers have automatic shared-memory parallelization capability, it is rarely used in practice by application developers because of limited effectiveness. ...
In this paper we describe our recent efforts towards developing an effective automatic parallelization system that uses a polyhedral model for data dependences and program transformations. ...
Acknowledgments We would like to acknowledge Cédric Bastoul and other contributors to the CLooG code generator and Martin Griebl and team for the LooPo infrastructure. ...
doi:10.1109/ipdps.2008.4536401
dblp:conf/ipps/BondhugulaBHKRRS08
fatcat:gv2yaercm5dp7awfy7buz2pvte
Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout
[article]
2022
arXiv
pre-print
We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead. ...
Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound. ...
High-level synthesis tools also have the ability to perform loop unrolling, akin to automatic loop vectorization for CPUs which has been part of state-of-the-art compilers for years [7] , in order to ...
arXiv:2202.05933v2
fatcat:aohxwmpi7zh5ja26ki6xh7ekh4
Automatic data allocation and buffer management for multi-GPU machines
2013
ACM Transactions on Architecture and Code Optimization (TACO)
We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). ...
This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and ...
It provides a set of compute, data, and control flow directives for executing parallel for loops on accelerators. ...
doi:10.1145/2544100
fatcat:6jicdt2dcfebvlab4up62v72f4
Automatic C-to-CUDA Code Generation for Affine Programs
[chapter]
2010
Lecture Notes in Computer Science
efficient data access. ...
The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. ...
transforms) to generate the correct ordering of inter-tile and intra-tile loops. for each statement s ∈ S do 3. ...
doi:10.1007/978-3-642-11970-5_14
fatcat:euk4pngadbcrfdzheclqlslahu
Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS
2018
IEEE Transactions on Parallel and Distributed Systems
The approach is generally applicable to any stencil DSL that provides per loop data access information. ...
In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time ...
We acknowledge PRACE for awarding us access to resource Marconi based in Italy at Cineca. ...
doi:10.1109/tpds.2017.2778161
fatcat:cjoehtigwra6nm2qk4sznctq7i
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
[chapter]
2001
Lecture Notes in Computer Science
The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry ...
This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures ...
Acknowledgments We would like to thank the Ohio Supercomputer Center (OSC) for the use of their computing facilities, and the National Science Foundation for partial support through grants DMR-9520319, ...
doi:10.1007/3-540-45307-5_21
fatcat:6gdgmt65gbb6dhxsgouuwxyyjm
In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain speci c language, such as CloverLeaf 2D ...
We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. ...
ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is ...
doi:10.1145/3145617.3145619
dblp:conf/sc/RegulyMG17
fatcat:vi24kzcc5zdhxdrwvpi6bcc6r4
Beyond 16GB: Out-of-Core Stencil Computations
[article]
2017
arXiv
pre-print
In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D ...
We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. ...
ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is ...
arXiv:1709.02125v2
fatcat:tsdxnsxznfai5ooi5we5ujfbcu
A compiler framework for optimization of affine loop nests for gpgpus
2008
Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. ...
In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach ...
with direct access of a from global memory (column "Direct Global"). ...
doi:10.1145/1375527.1375562
dblp:conf/ics/BaskaranBKRRS08
fatcat:x6rdnmlkvzaw7jfcet3pxzsewi
« Previous
Showing results 1 — 15 out of 8,117 results