268 Hits in 3.7 sec

Enabling efficient stencil code generation in OpenACC

Alyson D. Pereira, Rodrigo C.O. Rocha, Márcio Castro, Luís F.W. Góes, Mario A.R. Dantas
2017 Procedia Computer Science  
In this paper, we propose stencil extensions to enable efficient code generation in OpenACC.  ...  Our results show that our stencil extensions may improve the performance of OpenACC in up to 28% and 45% on GPU and CPU, respectively.  ...  Enabling efficient stencil code generation in OpenACC . . .  ... 
doi:10.1016/j.procs.2017.05.155 fatcat:lrp2ydzi7ne37oh2nmlqaigr64

A Data-Centric Directive-Based Framework to Accelerate Out-of-Core Stencil Computation on a GPU

Jingcheng SHEN, Fumihiko INO, Albert FARRÉS, Mauricio HANZICH
2020 IEICE transactions on information and systems  
Great programming effort is needed to manually implement efficient out-of-core stencil code.  ...  The generated code also demonstrated usefulness with small datasets that fit in the device capacity, running 1.3 times as fast as an in-core implementation.  ...  Acknowledgments This study was supported in part by the Japan Society for the Promotion of Science KAKENHI Grant Numbers JP15H01687, JP16H02801, and JP20K21794.  ... 
doi:10.1587/transinf.2020pap0014 fatcat:z6dwu7a2g5fj3kahvb4elv5s4q

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption [article]

Suejb Memeti and Lu Li and Sabri Pllana and Joanna Kolodziej and Christoph Kessler
2017 arXiv   pre-print
To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines that was required to parallelize the code using a specific framework  ...  In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy.  ...  Acknowledgment This article is based upon work from COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet), supported by COST (European Cooperation in Science  ... 
arXiv:1704.05316v1 fatcat:lax3kghaxnanxixklx3haavlxa

Automatic generation of parallel C code for stencil applications written in MATLAB

Johannes Spazier, Steffen Christgau, Bettina Schnor
2016 Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming - ARRAY 2016  
For the Game-of-Life application, the generated parallel code shows nearly optimal speedup.  ...  This paper presents performance results of an automatic translation from a MATLAB subset into efficient parallelized C code for different architectures: multicores, compute clusters, and GPGPUs.  ...  In contrast, StencilPaC is able to generate efficient code for the lookup.  ... 
doi:10.1145/2935323.2935329 dblp:conf/pldi/SpazierCS16 fatcat:jxkflwc4czhh7c6mpzairp6v3e

Compiling a High-Level Directive-Based Programming Model for GPGPUs [chapter]

Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, Barbara Chapman
2014 Lecture Notes in Computer Science  
In this paper, we present the research and development challenges, and our solutions to create an open-source OpenACC compiler in a main stream compiler framework (OpenUH of a branch of Open64).  ...  OpenACC is an emerging directive-based programming model for programming accelerators that typically enable non-expert programmers to achieve portable and productive performance of their applications.  ...  Acknowledgements This work was supported in part by the NVIDIA and Department of Energy under Award Agreement No. DE-FC02-12ER26099.  ... 
doi:10.1007/978-3-319-09967-5_6 fatcat:64q7lozynvg7nnb2n2stqkn5ae

Directive-based GPU programming for computational fluid dynamics

Brent P. Pickering, Charles W. Jackson, Thomas R.W. Scogland, Wu-Chun Feng, Christopher J. Roy
2015 Computers & Fluids  
improvements attainable for our CFD algorithm on common GPU platforms, as well as to determine the modifications that must be made to the original source code in order to run efficiently on the GPU.  ...  In this work we analyze the popular OpenACC programming standard, as implemented by the PGI compiler suite, in order to evaluate its utility and performance potential in computational fluid dynamics (CFD  ...  Acknowledgments This work was supported by an Air Force Office of Scientific Research (AFOSR) Basic Research Initiative in the Computational Mathematics program with Dr.  ... 
doi:10.1016/j.compfluid.2015.03.008 fatcat:guzjdb7llnbsbosoczysbe3r7y

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms [article]

Weicheng Xue, Christopher J. Roy
2020 arXiv   pre-print
These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested.  ...  Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC.  ...  McCall and Behzad Baghapour for creating the original BDC code as well as giving advice, and thank Charles W. Jackson for reviewing the paper and participating in various helpful discussions.  ... 
arXiv:2006.02602v1 fatcat:vkldeh3tqfhyfovny27go7zx5y

Relative debugging for a highly parallel hybrid computer system

Luiz DeRose, Andrew Gontarek, Aaron Vose, Robert Moench, David Abramson, Minh Ngoc Dinh, Chao Jin
2015 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15  
In this paper we extend relative debugging to support porting stencil computation on a hybrid computer.  ...  We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs.  ...  Hybrid computers equipped with GPGPUs have been shown to execute stencil code efficiently [3] [46].  ... 
doi:10.1145/2807591.2807605 dblp:conf/sc/RoseGVMADJ15 fatcat:pjq7qzkylzewzlphqoermywo64

Towards a performance portable, architecture agnostic implementation strategy for weather and climate models

2014 Supercomputing Frontiers and Innovations  
The dynamical core is built on top of the domain specific "Stencil Loop Language" for stencil computations on structured grids, a generic framework for halo exchange and boundary conditions, as well as  ...  All these tools are implemented in C++ making extensive use of generic programming and template metaprogramming.  ...  Fortran code refactoring (OpenACC) The parts of the code that have been left in their original Fortran form were ported to GPUs with OpenACC compiler directives [27] .  ... 
doi:10.14529/jsfi140103 fatcat:b2fj5fsrmnfwfpdjawmvm3ssdu

An MPI+OpenACC-based PRM scalar advection scheme in the GRAPES model over a cluster with multiple CPUs and GPUs

Huadong Xiao, Yang Lu, Jianqiang Huang, Wei Xue
2022 Tsinghua Science and Technology  
/OpenCL and Open Accelerator (OpenACC).  ...  Computation of the scalar advection involves boundary exchange, and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.  ...  Two allocatable arrays were used to pack and unpack the halo data for arrays with dimensions of 2, 3, and 4 in the original version, enabling the efficient transfer of contiguous ranges between different  ... 
doi:10.26599/tst.2020.9010026 fatcat:yzbiiw3l6zbdlolxytsvprqg3m

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance [chapter]

Guido Juckeland, William Brantley, Sunita Chandrasekaran, Barbara Chapman, Shuai Che, Mathew Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-Mei W. Hwu, Huian Li, Matthias S. Müller (+12 others)
2015 Lecture Notes in Computer Science  
The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform.  ...  Hybrid nodes with hardware accelerators are becoming very common in systems today.  ...  The authors thank Cloyce Spradling for his work on the SPEC harness as well as the SPEC POWER group for their work on enabling the integration of power measurements into other SPEC suites.  ... 
doi:10.1007/978-3-319-17248-4_3 fatcat:wcdquz4gqffsrihtu3olf5nuty

Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems [chapter]

G. R. Mudalige, I. Z. Reguly, M. B. Giles, A. C. Mallinson, W. P. Gaudin, J. A. Herdman
2015 Lecture Notes in Computer Science  
Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain  ...  the appropriate parallel library enabling execution on different back-end hardware platforms.  ...  The main aim of the project is to generate cache efficient multi-threaded CPU code for structured mesh (i.e. stencil) computations.  ... 
doi:10.1007/978-3-319-17248-4_5 fatcat:djnfmpwt75f2biaupo3nkdztzu

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization [article]

Kazuaki Matsumura, Simon Garcia De Gonzalo, Antonio J. Peña
2021 arXiv   pre-print
This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler.  ...  Optimizations for obtaining the best possible efficiency, however, are often challenging.  ...  From OpenACC code in C or Fortran, our implementation generates JACC code, in which kernel code is embedded as strings.  ... 
arXiv:2110.14340v1 fatcat:acfa6g7xm5dyfajen7fqkn4yri

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs [article]

Wim Vanderbauwhede, Syed Waqar Nabi
2019 arXiv   pre-print
There is a large body of legacy scientific code written in languages like Fortran that is not optimised to get the best performance out of heterogeneous acceleration devices like GPUs and FPGAs, and manually  ...  Our results show better FPGA performance against a baseline CPU implementation, and better energy-efficiency against both CPU and GPU implementations.  ...  extending the current compiler to enable automatic generation of optimized FPGA OpenCL code.  ... 
arXiv:1901.00416v2 fatcat:zw2wwku7j5c5xkjlvvmbb4kucq

Structure Slicing: Extending Logical Regions with Fields

Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
, with speedups of up to 3.68X over a vectorized CPU-only Fortran implementation and 1.88X over an independently hand-tuned OpenACC code.  ...  In this paper, we present structure slicing, which incorporates fields into the logical region data model.  ...  Entangling layout with data movement optimizations in the OpenACC code results in code that is difficult to modify when exploring different mapping strategies and tuning for new architectures.  ... 
doi:10.1109/sc.2014.74 dblp:conf/sc/BauerTSA14 fatcat:vqinvcz2cnd6pfjkxdalg2qcwe
« Previous Showing results 1 — 15 out of 268 results