Filters








1,557 Hits in 2.9 sec

Register optimizations for stencils on GPUs

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, P. Sadayappan
2018 SIGPLAN notices  
While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal  ...  A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse.  ...  Acknowledgments We thank the anonymous reviewers for their feedback and suggestions that helped improve the paper.  ... 
doi:10.1145/3200691.3178500 fatcat:hfof4j52hfehdlatemtmgugzhy

Register optimizations for stencils on GPUs

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, P. Sadayappan
2018 Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '18  
While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal  ...  A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse.  ...  Acknowledgments We thank the anonymous reviewers for their feedback and suggestions that helped improve the paper.  ... 
doi:10.1145/3178487.3178500 dblp:conf/ppopp/RawatRSPRS18 fatcat:fv2wqp3ktrakhg4gp4yvd77fyi

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures [chapter]

Pablo J. Pavan, Matheus S. Serpa, Emmanuell Diaz Carreño, Víctor Martínez, Edson Luiz Padoin, Philippe O. A. Navaux, Jairo Panetta, Jean-François Mehaut
2019 Communications in Computer and Information Science  
The optimizations we developed applied to Graphics Processing Units (GPU) algorithms for stencil applications achieve a performance improvement of up to 44.65% compared with the read-only version.  ...  In this context, this paper proposes optimization methods to accelerate performance and increase energy efficiency of geophysics applications used in conjunction to algorithm and GPU memory characteristics  ...  For this reason, one of the most important strategies for optimizing the performance of stencil computing is the optimization of memory access.  ... 
doi:10.1007/978-3-030-16205-4_9 fatcat:rfcguh7u4rgfpb7nzl24zbhv5y

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, Satoshi Matsuoka
2020 Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization  
We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.  ...  Stencil computation is one of the most widely-used compute patterns in high performance computing applications.  ...  In [27] , Rawat et al. present another DSL-based stencil framework called ARTEMIS which supports flexible resource allocation on GPUs (global memory or share memory + register) for each input/output  ... 
doi:10.1145/3368826.3377904 dblp:conf/cgo/MatsumuraZWEM20 fatcat:x2y7cxwhw5c4tma5tjy44e6oey

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noël Pouchet, P. Sadayappan
2016 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit - GPGPU '16  
A major challenge in optimizing stencil computations is to effectively utilize all resources available on the GPU.  ...  While effective for 2D stencils, these techniques do not achieve the desired improvements for 3D stencils due to the hardware constraints of GPU.  ...  Optimization for associative stencils. For stencils that access more than one point per plane from other planes along z axis, the streaming + registers version will incur high register pressure.  ... 
doi:10.1145/2884045.2884047 dblp:conf/ppopp/RawatHRGPS16 fatcat:i2xqunmus5bsja6di542judxpa

Automatic communication optimizations through memory reuse strategies

Muthu Manikandan Baskaran, Nicolas Vasilache, Benoit Meister, Richard Lethin
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
We apply our techniques and obtain performance improvement on various stencil kernels including an important iterative stencil kernel in seismic processing applications where the performance is comparable  ...  This general concept is well-suited to the hardware properties of GPGPUs, which is the architecture that we concentrate on for this paper.  ...  Details of Communication Optimizations Our emphasis in this work is on the memory reuse optimizations in GPUs that particularly focus on managing on-chip memories such as the shared memory and registers  ... 
doi:10.1145/2145816.2145852 dblp:conf/ppopp/BaskaranVML12 fatcat:tmy6f3uoife3rlsvfm4lipliy4

Optimized three-dimensional stencil computation on Fermi and Kepler GPUs

Anamaria Vizitiu, Lucian Itu, Cosmin Nita, Constantin Suciu
2014 2014 IEEE High Performance Extreme Computing Conference (HPEC)  
Overall, the GTX680 GPU card performs best for a kernel with 2D thread block structure and optimized register and shared memory usage.  ...  In this paper we focus on double precision stencil computations, which are required for meeting the high accuracy requirements, inherent for scientific computations.  ...  Different optimization techniques have been reported more recently for GPU based stencil computations.  ... 
doi:10.1109/hpec.2014.7040968 dblp:conf/hpec/VizitiuINS14 fatcat:6qcen4yfdrb6dny2lwyt6hck4e

Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method

Wai Teng Tang, Wen Jun Tan, Ratna Krishnamoorthy, Yi Wen Wong, Shyh-Hao Kuo, Rick Siow Mong Goh, Stephen John Turner, Weng-Fai Wong
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
In this work, we proposed a novel in-plane method for stencil computations on GPUs and compared its performance with the conventional method implemented in the Nvidia SDK.  ...  We also implemented an auto-tuning framework for our method to select the optimal parameters for different GPU architectures.  ...  ACKNOWLEDGMENT This work was supported by the Agency for Science, Technology and Research PSF Grant No. 102-101-0028. We are also grateful to the anonymous reviewers for their comments.  ... 
doi:10.1109/ipdps.2013.79 dblp:conf/ipps/TangTKWKGTW13 fatcat:iewu2ceayrhqpd6x7mw3sdtoye

A versatile software systolic execution model for GPU memory-bound kernels

Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka
2019 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '19  
For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.  ...  This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs.  ...  Rawat et al. proposed a reorder framework to optimize register allocation for both CPUs and GPUs [47, 48] .  ... 
doi:10.1145/3295500.3356162 dblp:conf/sc/ChenWTTM19 fatcat:xdivgrmgezfwrgzqp3rx6lmrm4

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Yongpeng Zhang, Frank Mueller
2012 Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12  
This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs.  ...  and generates the code with optimal parameter configurations for different GPUs.  ...  Overall, there is no universal, optimal configuration for all types of stencil computations on different GPU models.  ... 
doi:10.1145/2259016.2259037 dblp:conf/cgo/ZhangM12 fatcat:fly3dlcqnrdbnbneeab4ylcazy

Persistent Kernels for Iterative Memory-bound GPU Applications [article]

Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Satoshi Matsuoka
2022 arXiv   pre-print
Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are.  ...  We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS).  ...  We show notable performance improvement for iterative 2D/3D stencils and a conjugate gradient solver for both V100 and A100 over highly optimized baselines.  ... 
arXiv:2204.02064v2 fatcat:campsz22iff5jfdmo7nrth7xje

Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters

Yongpeng Zhang, Frank Mueller
2013 IEEE Transactions on Parallel and Distributed Systems  
This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs.  ...  and generates the code with optimal parameter configurations for different GPUs.  ...  Overall, there is no universal, optimal configuration for all types of stencil computations on different GPU models.  ... 
doi:10.1109/tpds.2012.160 fatcat:ps3snatb6vdehct2hlfrviav4e

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax–Wendroff correction stencil

Yang You, Haohuan Fu, Shuaiwen Leon Song, Maryam Mehri Dehnavi, Lin Gan, Xiaomeng Huang, Guangwen Yang
2014 The international journal of high performance computing applications  
For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.  ...  We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted  ...  Acknowledgements We would like to thank Zihong Lv for his advice in paper writing.  ... 
doi:10.1177/1094342014524807 fatcat:2mbqnyrhtngatl6wxmcuxdfiya

Efficient 3D stencil computations using CUDA

Marcin Krotkiewski, Marcin Dabrowski
2013 Parallel Computing  
We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs.  ...  Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented.  ...  Section 7 presents our algorithms and performance results in the context of previous work on optimizing stencil computations for the GPUs.  ... 
doi:10.1016/j.parco.2013.08.002 fatcat:pz2y4ntllvcs7dc7jxldmikjjq

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang Yang, Hui-Min Cui, Xiao-Bing Feng, Jing-Ling Xue
2012 Journal of Computer Science and Technology  
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory.  ...  Validation using four diAEerent types of stencils on three diAEerent GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory  ...  Our work is based on circular queue, and extends it to registers. Stencil on GPU.  ... 
doi:10.1007/s11390-012-1206-3 fatcat:r45ekbmayjayrfrmka77nqqe44
« Previous Showing results 1 — 15 out of 1,557 results