Filters








8,117 Hits in 3.8 sec

Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor [chapter]

Ge Gan, Xu Wang, Joseph Manzano, Guang R. Gao
2009 Lecture Notes in Computer Science  
The paper provides (a) an exploration of the possibility of developing pragma directives for semi-automatic data movement code generation in OpenMP; (b) an introduction of techniques used to implement  ...  Currently, all OpenMP directives are only used to decompose computation code (such as loop iterations, tasks, code sections, etc.).  ...  The speedup diminishing return point of SGEMV is 8-thread, while for SGEMM, it is 2-thread. For SASUM, its memory accesses and floating-point operations are the same.  ... 
doi:10.1007/978-3-642-03869-3_78 fatcat:6s6qil3finaobpnhziqls2mgjq

Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA

Christophe Alias, Alain Darte, Alexandru Plesco
2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013  
Loop tiling is used to enable block communications, suitable for DDR memories.  ...  Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory.  ...  Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement.  ... 
doi:10.7873/date.2013.127 dblp:conf/date/AliasDP13 fatcat:ur4zoex4pfhjtihvoeg6rfpojm

Optimizing remote accesses for offloaded kernels

Christophe Alias, Alain Darte, Alexandru Plesco
2012 SIGPLAN notices  
Loop tiling is used to enable block communications, suitable for DDR memories.  ...  Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory.  ...  Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement.  ... 
doi:10.1145/2370036.2145856 fatcat:ij2tevq3ebhc3mvvqdwg25rviq

Optimizing remote accesses for offloaded kernels

Christophe Alias, Alain Darte, Alexandru Plesco
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
Loop tiling is used to enable block communications, suitable for DDR memories.  ...  Finally, data reuse among tiles is exploited to avoid remote accesses when data are already available in the local memory.  ...  Loop tiling and transformation function Loop tiling is a standard loop transformation, known to be effective for automatic parallelization and data locality improvement.  ... 
doi:10.1145/2145816.2145856 dblp:conf/ppopp/AliasDP12 fatcat:oxfjji66ajdudjfp2oblmyyxqe

An Approach for Semiautomatic Locality Optimizations Using OpenMP [chapter]

Jens Breitbart
2012 Lecture Notes in Computer Science  
Our notion of tiled loops allows developers to easily describe data locality even at scenarios with non-trivial data dependencies. Furthermore, we describe two new optimization techniques.  ...  As an additional contribution we explore the benefit of using multiple levels of tiling.  ...  Making the notion of tiles available in OpenMP will not only enable developers to specify data locality and thereby increase performance on current CPUs, but lays out the foundation for future work to  ... 
doi:10.1007/978-3-642-28145-7_29 fatcat:pnlzpvfi65fajih63nchg7wqoi

Scaling Data-Intensive Applications on Heterogeneous Platforms with Accelerators

Ana Balevic, Bart Kienhuis
2012 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum  
O Tiling -Stand-alone or additional level of tiling in existing polyhedral frameworks • Mapping of tile access and communication code -Run-time support: • Tile streaming model -Asynchronous execution and  ...  -Buffered communication and streaming to GPU -Tiling / multi-dimensional strip-mining -Decompose outer loop nest(s) into two loopsTile-loop • Point-loop -Interchange -Coarse-grain parallelism  ... 
doi:10.1109/ipdpsw.2012.230 dblp:conf/ipps/BalevicK12 fatcat:w46lyu4cf5gpfj6eeym5xhm57y

Towards effective automatic parallelization for multicore systems

Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan
2008 Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)  
Although virtually all production C compilers have automatic shared-memory parallelization capability, it is rarely used in practice by application developers because of limited effectiveness.  ...  In this paper we describe our recent efforts towards developing an effective automatic parallelization system that uses a polyhedral model for data dependences and program transformations.  ...  Acknowledgments We would like to acknowledge Cédric Bastoul and other contributors to the CLooG code generator and Martin Griebl and team for the LooPo infrastructure.  ... 
doi:10.1109/ipdps.2008.4536401 dblp:conf/ipps/BondhugulaBHKRRS08 fatcat:gv2yaercm5dp7awfy7buz2pvte

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout [article]

Corentin Ferry and Tomofumi Yuki and Steven Derrien and Sanjay Rajopadhye
2022 arXiv   pre-print
We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead.  ...  Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound.  ...  High-level synthesis tools also have the ability to perform loop unrolling, akin to automatic loop vectorization for CPUs which has been part of state-of-the-art compilers for years [7] , in order to  ... 
arXiv:2202.05933v2 fatcat:aohxwmpi7zh5ja26ki6xh7ekh4

Automatic data allocation and buffer management for multi-GPU machines

Thejas Ramashekar, Uday Bondhugula
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM).  ...  This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and  ...  It provides a set of compute, data, and control flow directives for executing parallel for loops on accelerators.  ... 
doi:10.1145/2544100 fatcat:6jicdt2dcfebvlab4up62v72f4

Automatic C-to-CUDA Code Generation for Affine Programs [chapter]

Muthu Manikandan Baskaran, J. Ramanujam, P. Sadayappan
2010 Lecture Notes in Computer Science  
efficient data access.  ...  The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks.  ...  transforms) to generate the correct ordering of inter-tile and intra-tile loops. for each statement s ∈ S do 3.  ... 
doi:10.1007/978-3-642-11970-5_14 fatcat:euk4pngadbcrfdzheclqlslahu

Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS

Istvan Z. Reguly, Gihan R. Mudalige, Michael B. Giles
2018 IEEE Transactions on Parallel and Distributed Systems  
The approach is generally applicable to any stencil DSL that provides per loop data access information.  ...  In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time  ...  We acknowledge PRACE for awarding us access to resource Marconi based in Italy at Cineca.  ... 
doi:10.1109/tpds.2017.2778161 fatcat:cjoehtigwra6nm2qk4sznctq7i

Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization [chapter]

D. Cociorva, J. Wilkins, G. Baumgartner, P. Sadayappan, J. Ramanujam, M. Nooijen, D. Bernholdt, R. Harrison
2001 Lecture Notes in Computer Science  
The goal of our project is the development of a program synthesis system to facilitate the development of high-performance parallel programs for a class of computations encountered in computational chemistry  ...  This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures  ...  Acknowledgments We would like to thank the Ohio Supercomputer Center (OSC) for the use of their computing facilities, and the National Science Foundation for partial support through grants DMR-9520319,  ... 
doi:10.1007/3-540-45307-5_21 fatcat:6gdgmt65gbb6dhxsgouuwxyyjm

Beyond 16GB

Istán Z. Reguly, Gihan R. Mudalige, Michael B. Giles
2017 Proceedings of the Workshop on Memory Centric Programming for HPC - MCHPC'17  
In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain speci c language, such as CloverLeaf 2D  ...  We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement.  ...  ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is  ... 
doi:10.1145/3145617.3145619 dblp:conf/sc/RegulyMG17 fatcat:vi24kzcc5zdhxdrwvpi6bcc6r4

Beyond 16GB: Out-of-Core Stencil Computations [article]

Istvan Z Reguly, Gihan R Mudalige, Michael B Giles
2017 arXiv   pre-print
In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D  ...  We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement.  ...  ACKNOWLEDGMENTS e authors would like to thank IBM (József Surányi, Michal Iwanski) for access to a Minsky system, as well as Nikolay Sakharnykh at NVIDIA for the help with uni ed memory performance. is  ... 
arXiv:1709.02125v2 fatcat:tsdxnsxznfai5ooi5we5ujfbcu

A compiler framework for optimization of affine loop nests for gpgpus

Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling.  ...  In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach  ...  with direct access of a from global memory (column "Direct Global").  ... 
doi:10.1145/1375527.1375562 dblp:conf/ics/BaskaranBKRRS08 fatcat:x6rdnmlkvzaw7jfcet3pxzsewi
« Previous Showing results 1 — 15 out of 8,117 results