Filters








360 Hits in 7.0 sec

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, P. Sadayappan
2016 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit - GPGPU '16  
While effective for 2D stencils, these techniques do not achieve the desired improvements for 3D stencils due to the hardware constraints of GPU.  ...  Applying these techniques to various 2D and 3D stencils gives a performance improvement of 200-400% over existing tools that target such computations.  ...  computations that overcomes these resource bottlenecks by effectively managing the shared memory and registers that are available on the GPU. • We evaluate the effect of using associative reordering of  ... 
doi:10.1145/2884045.2884047 dblp:conf/ppopp/RawatHRGPS16 fatcat:i2xqunmus5bsja6di542judxpa

A versatile software systolic execution model for GPU memory-bound kernels

Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka
2019 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '19  
For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.  ...  We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning  ...  for the 2D and 3D stencil benchmarks on Tesla P100 and V100 GPUs, the x-axis is the stencil benchmarks defined in Table 3 .  ... 
doi:10.1145/3295500.3356162 dblp:conf/sc/ChenWTTM19 fatcat:xdivgrmgezfwrgzqp3rx6lmrm4

Accelerating High-Order Stencils on GPUs [article]

Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, Jie Meng
2020 arXiv   pre-print
While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such as those used for seismic  ...  In this paper, we study high-order stencils and their unique characteristics on GPUs.  ...  We thank Keren Zhou from Rice University for reviewing the drafts of this paper and helping us use his emerging GPU Performance Advisor tool, which offered insights for tuning some of the kernels we studied  ... 
arXiv:2009.04619v2 fatcat:tq7kxemsejbyjdej32ka5nqrfi

Persistent Kernels for Iterative Memory-bound GPU Applications [article]

Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Satoshi Matsuoka
2022 arXiv   pre-print
We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of 2.29x in small domains and 1.53x in  ...  Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are.  ...  We show notable performance improvement for iterative 2D/3D stencils and a conjugate gradient solver for both V100 and A100 over highly optimized baselines.  ... 
arXiv:2204.02064v2 fatcat:campsz22iff5jfdmo7nrth7xje

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Lingyuan Wang, Miaoqing Huang, Vikram K. Narayana, Tarek El-Ghazawi
2011 Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11  
Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains.  ...  However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU  ...  , and (b) MG is comprised of stencil computations and boundary exchange on a 3D mesh.  ... 
doi:10.1145/2016604.2016612 dblp:conf/cf/WangHNE11 fatcat:ebjggwl76ffhlbq7gawz4ksfe4

Hybrid Hexagonal/Classical Tiling for GPUs

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, Sven Verdoolaege
2014 Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14  
We propose a time-tiling method for iterative stencil computations on GPUs. Our method does not involve redundant computations.  ...  Time-tiling is necessary for the efficient execution of iterative stencil computations.  ...  This work is partly funded by a Google European Fellowship in Efficient Computing, by the European FP7 project CARP id. 287767, by the COPCAMS ARTEMIS project, and award 0926688 from the U.S. NSF.  ... 
doi:10.1145/2581122.2544160 fatcat:mxabceid25cobd4dekka633kna

Multi-GPU Implementation of a 3D Finite Difference Time Domain Earthquake Code on Heterogeneous Supercomputers

Jun Zhou, Yifeng Cui, Efecan Poyraz, Dong Ju Choi, Clark C. Guest
2013 Procedia Computer Science  
We have developed a highly scalable 3D Finite Difference GPU code for use in earthquake engineering and disaster management through regional petascale earthquake simulations.  ...  This multi-GPU implementation has been validated and used for a large-scale verification wave propagation simulation of Mw5.4 Chino Hills earthquake using 128 GPUs.  ...  Fig. 4 . 4 Definitions of regions and symbols for the enhanced overlapping algorithm Fig. 5 . 5 Effective computation and communication overlapping algorithm for AWP-ODC-GPU implementation.AWP-ODC-GPU  ... 
doi:10.1016/j.procs.2013.05.292 fatcat:ocdk7zb65bedjlv24s7t7j2oi4

Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

Wei Xue, Chao Yang, Haohuan Fu, Xinliang Wang, Yangtong Xu, Junfeng Liao, Lin Gan, Yutong Lu, Rajiv Ranjan, Lizhe Wang
2015 IEEE transactions on computers  
A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance.  ...  In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2.  ...  To enhance the performance on many-core accelerators, we focus our work on tuning the stencil kernel of the 3D atmospheric model for Intel MIC architecture by model reformulation, data layout optimization  ... 
doi:10.1109/tc.2014.2366754 fatcat:hootyxzfaffftlkp4yxc5w6uue

AMC: Advanced Multi-accelerator Controller

Tassadaq Hussain, Amna Haider, Shakaib A. Gursal, Eduard Ayguadé
2015 Parallel Computing  
The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip.  ...  Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator's memory access patterns efficiently.  ...  For a single 3D-Stencil volume, number of accessed points is dependent on the number of planes and the stencil size.  ... 
doi:10.1016/j.parco.2014.10.003 fatcat:z7xne5erxjbihk54ns6kjwjpve

Entropic lattice Boltzmann simulation of three-dimensional binary gas mixture flow in packed beds using graphics processors

Mohammad Amin Safi, Mahmud Ashrafizaadeh
2016 International Journal of Computational Science and Engineering (IJCSE)  
Performance gains of one order of magnitude over optimized multi-core CPUs are achieved for the complex flow of interest on Fermi generation GPUs.  ...  Simulations are performed based on the latest proposed entropic lattice Boltzmann model for multi-component flows, using the D3Q27 lattice stencil.  ...  Although the problem size of the present simulations is limited by the available GPU memory resources, porting the problem to large GPU-based clusters alleviates such restrictions.  ... 
doi:10.1504/ijcse.2016.076937 fatcat:n5wesxde3nhqbp772fry4awl2a

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax–Wendroff correction stencil

Yang You, Haohuan Fu, Shuaiwen Leon Song, Maryam Mehri Dehnavi, Lin Gan, Xiaomeng Huang, Guangwen Yang
2014 The international journal of high performance computing applications  
We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted  ...  take advantage of our evaluated architectures, we manage to achieve performance efficiencies ranging from 4.730% to 20.02% of the theoretical peak.  ...  Funding This work was supported in part by the National Natural Science Foundation of China (grant numbers 61303003 and 41374113) and the National High-tech R&D (863) Program of China (grant number 2013AA01A208  ... 
doi:10.1177/1094342014524807 fatcat:2mbqnyrhtngatl6wxmcuxdfiya

GPU technology applied to reverse time migration and seismic modeling via OpenACC

Ahmad Qawasmeh, Barbara Chapman, Maxime Hugues, Henri Calandra
2015 Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '15  
A performance enhancement of ∼ 10x was obtained, when the acoustic model was ported to a single GPU, compared with a 1.3x speedup obtained using the isotropic model.  ...  Although we implement a hybrid OpenACC-MPI approach to parallelize seismic modeling and RTM on multiple GPUs, in this paper, we focus on developing mapping techniques to exploit potentials of one GPU.  ...  We would also like to thank Sunita Chandrasekaran from the HPCTools group at University of Houston for her feedback on the paper. Many thanks go to TOTAL for providing the computing resources.  ... 
doi:10.1145/2712386.2712401 dblp:conf/ppopp/QawasmehCHC15 fatcat:bvb257wfc5a7ni5k3hqtggmokq

Efficient Implementation of Liquid Crystal Simulation Software on Modern HPC Platforms

Ilya V. Afanasyev, Dmitry I. Lichmanov, Vladimir Yu. Rudyak, Vadim V. Voevodin
2021 Supercomputing Frontiers and Innovations  
On this basis, we evaluate and compare the efficiency of the developed computational kernels on different platforms and subsequently rank these platforms by their performance.  ...  We evaluate the effects of various optimizations, such as using more suitable memory access patterns, multitasking for efficient utilization of massive parallelism on the target architectures, improved  ...  Acknowledgments The reported study was funded by the Russian Foundation for Basic Research, project number 20-37-70036. The work presented in section 5.  ... 
doi:10.14529/jsfi210306 dblp:journals/superfri/AfanasyevLRV21 fatcat:hnb7igitebc2hc77xv7enhpnnm

Forma: a DSL for image processing applications to target GPUs and multi-core CPUs

Mahesh Ravishankar, Justin Holewinski, Vinod Grover
2015 Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015  
The high-level description allows the compiler to generate efficient code through use of compile-time analysis and by taking advantage of hardware resources, like texture memory on GPUs.  ...  Our experimental result show that using Forma allows developers to obtain comparable performance on both CPU and GPU with lesser programmer effort.  ...  Sadayappan from Ohio State University for his comments. Finally, we thank the reviewers of this paper for their helpful comments regarding related work and possible enhancements.  ... 
doi:10.1145/2716282.2716290 dblp:conf/ppopp/RavishankarHG15 fatcat:cwggdu43ufczdjtiz4bluw7bba

IMPLEMENTATION OF THE DIFFERENCE SCHEME FOR ABSORPTION EQUATION TYPE PROBLEMS APPLYING PARALLEL COMPUTING TECHNOLOGIES

Maksims Zigunovs
2021 Environment Technology Resources Proceedings of the International Scientific and Practical Conference  
This paper describes a way of parallel algorithm technology usage for analyzing physical processes parabolic differential problems on the surface.  ...  Parallel computing technologies usage provides an acceleration possibilities of mentioned calculations in different way and effect depending of parallel technology type and method combinations used during  ...  Fig. 6 . 6 ADI Each partstep equation contains 3 unknowns in only one direction for a 5 point stencil (examples: 5 point stencil for 2D space and 7 point stencil for 3D space) and all other direction unknowns  ... 
doi:10.17770/etr2021vol2.6633 fatcat:olcmhicmxzbuti4wdbtzxaqdxi
« Previous Showing results 1 — 15 out of 360 results