726 Hits in 5.2 sec

Characterizing and enhancing global memory data coalescing on GPUs

Naznin Fauzia, Louis-Noel Pouchet, P. Sadayappan
2015 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory.  ...  There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem.  ...  National Science Foundation through awards 0926127, 1321147 and 1440749.  ... 
doi:10.1109/cgo.2015.7054183 dblp:conf/cgo/FauziaPS15 fatcat:cu4ipgwwbrdjtetkr4vjr5oo6m

CUDA-Lite: Reducing GPU Programming Complexity [chapter]

Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, Wen-mei W. Hwu
2008 Lecture Notes in Computer Science  
We present CUDA-lite, an enhancement to CUDA, as one such tool.  ...  Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer.  ...  Acknowledgment We would like to thank David Kirk and NVIDIA for generous hardware loans and support. We also thank the anonymous reviewers for their feedback.  ... 
doi:10.1007/978-3-540-89740-8_1 fatcat:wkn4kvk4h5ephpxjw3tcxl4ham

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features [article]

Aditya Chilukuri, Josh Milthorpe, Beau Johnston
2020 arXiv   pre-print
The new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access patterns affecting their performance on various architectures.  ...  The Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program  ...  On most NVIDIA GPUs, data accesses are coalesced when multiple requests are made for memory locations from the same cache line in global memory [15] .  ... 
arXiv:2003.06064v1 fatcat:24y6blwtofb6njvhq3ny6dccvu

A compiler framework for optimization of affine loop nests for gpgpus

Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan
2008 Proceedings of the 22nd annual international conference on Supercomputing - ICS '08  
to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding  ...  factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling.  ...  National Science Foundation through awards 0121676, 0121706, 0403342, 0508245, 0509442, 0509467 and 0541409.  ... 
doi:10.1145/1375527.1375562 dblp:conf/ics/BaskaranBKRRS08 fatcat:x6rdnmlkvzaw7jfcet3pxzsewi

Performance characterization of mobile GP-GPUs

Fitsum Assamnew Andargie, Jonathan Rose
2015 AFRICON 2015  
In this paper we unearth key microarchitectural parameters of the Qualcomm Adreno 320 and 420 GP GPUs, present in one of the key SoCs in the industry, the Snapdragon series of chips.  ...  As smartphones and tablets have become more sophisticated, they now include General Purpose Graphics Processing Units (GP GPUs) that can be used for computation beyond driving the high-resolution screens  ...  Global Memory Throughput Measurement Access to the global external memory in a GPU (shown as main memory in Figure 1 ) is much slower compared to the constant, local and private memories that are on-chip  ... 
doi:10.1109/afrcon.2015.7332026 dblp:conf/africon/AndargieR15 fatcat:swqaj33injcsheq4cqlmurp4im

Investigating Warp Size Impact in GPUs [article]

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari
2012 arXiv   pre-print
We analyze warp size impact on memory coalescing and branch divergence.  ...  Our evaluations show that building coalescing-enhanced small warp GPUs is a better approach compared to pursuing a control-flow enhanced large warp GPU.  ...  We use our analysis to investigate the effectiveness of two possible approaches to enhance GPUs. The first approach relies on enhancing memory coalescing in GPUs using large warps.  ... 
arXiv:1205.4967v1 fatcat:7qb3b556c5cafoyedr56piotzy

GDPI: Signature based Deep Packet Inspection using GPUs

Nausheen Shoaib, Jawwad Shamsi, Tahir Mustafa, Akhter Zaman, Jazib ul, Mishal Gohar
2017 International Journal of Advanced Computer Science and Applications  
The framework is developed using enhanced GPU programming techniques, such as asynchronous packet processing using streams, minimizing CPU to GPU latency using pinned memory and zero copy, and memory coalescing  ...  with shared memory which reduces read operation from global memory of the GPU.  ...  GPU global memory.  ... 
doi:10.14569/ijacsa.2017.081128 fatcat:hrzim6u7efaedjvinb3luumfqe

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
2011 IEEE Transactions on Parallel and Distributed Systems  
In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies  ...  One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures.  ...  The authors would also like to thank the anonymous contributors on the AMD stream computing and CUDA forums for their valuable discussions and clarification of some subjects.  ... 
doi:10.1109/tpds.2010.107 fatcat:lync4a5tlvf37g3w5kuomqzxje

Warp size impact in GPUs

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari
2013 Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6  
We analyze warp size impact on memory coalescing and branch divergence.  ...  Our evaluations show that building coalescing-enhanced small warp GPUs is a better approach compared to pursuing a controlflow enhanced large warp GPU.  ...  This work was supported by School of Computer Science at Institute for Research in Fundamental Sciences (IPM) and the Natural Sciences and Engineering Research Council of Canada, Discovery Grants Program  ... 
doi:10.1145/2458523.2458538 dblp:conf/asplos/LashgarBK13 fatcat:pz23sadgtbauxgmhic26cw4upu

Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms

Yash Ukidave, Amir Kavyan Ziabari, Perhaad Mistry, Gunar Schirner, David Kaeli
2014 The international journal of high performance computing applications  
Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy  ...  More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design  ...  Funding This work was supported by Analog Devices Inc, AMD, Nvidia and Qualcomm, and by an National Science Foundation (NSF) ERC Innovation Award (grant number EEC-0946463) and an NSF CNS Award.  ... 
doi:10.1177/1094342014526907 fatcat:tj6z7n6esbhhlmpfnkpuedohj4

Low-cost, high-speed computer vision using NVIDIA's CUDA architecture

Seung In Park, Sean P. Ponce, Jing Huang, Yong Cao, Francis Quek
2008 2008 37th IEEE Applied Imagery Pattern Recognition Workshop  
GPUs are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel.  ...  Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of Canny's edge detection algorithm, and applying it to a computation and data-intensive video motion tracking  ...  -0551610, and "Embodiment Awareness, Mathematics Discourse and the Blind," NSF-IIS-0451843.  ... 
doi:10.1109/aipr.2008.4906458 dblp:conf/aipr/ParkPHCQ08 fatcat:rxfoqkw63be7bhovwqaegyccem

Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou
2012 2012 41st International Conference on Parallel Processing  
., code segments leading to inefficient use of GPU hardware. We characterize these performance bugs, and propose the bug fixes.  ...  Our experiments confirm both significant performance gains and energy savings from our fixes and reveal interesting insights on different GPUs.  ...  This work is supported by an NSF CAREER award CCF-0968667 and a gift fund from AMD Inc.  ... 
doi:10.1109/icpp.2012.30 dblp:conf/icpp/YangXMZ12 fatcat:uh4ggzczmfevxkvteqsl7n45iy

Parallel hybrid evolutionary algorithms on GPU

The Van Luong, Nouredine Melab, El-Ghazali Talbi
2010 IEEE Congress on Evolutionary Computation  
This paper presents a new methodology to design and implement efficiently and effectively hybrid evolutionary algorithms on GPU accelerators.  ...  The methodology enables efficient mappings of the explored search space onto the GPU memory hierarchy.  ...  Due to high misaligned accesses to global memories (flows and distances in QAP), non-coalescing memory reduces the performance of the GPU implementation.  ... 
doi:10.1109/cec.2010.5586403 dblp:conf/cec/LuongMT10 fatcat:d5f3jph2kza4nc2xsqfqvw2pny

An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units

Sang-Pil Lee, Deok-Ho Kim, Jae-Young Yi, Won-Woo Ro
2012 Journal of Information Processing Systems  
The recent emergence of VLSI technology makes it feasible to fabricate multiple processing cores on a single chip and enables general-purpose computation on a GPU (GPGPU).  ...  This paper presents a study on a high-performance design for a block cipher algorithm implemented on modern many-core graphics processing units (GPUs).  ...  There were no non-coalesced global memory loads/stores. Hence, all of the data was transferred efficiently between the global memory and the shared memory.  ... 
doi:10.3745/jips.2012.8.1.159 fatcat:ui4gee3smzbozm74wkw37go64m

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt
2009 2009 IEEE International Symposium on Performance Analysis of Systems and Software  
mechanisms, and memory request coalescing hardware.  ...  The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism.  ...  Acknowledgments We thank Kevin Skadron, Michael Shebanow, John Kim, Andreas Moshovos, Xi Chen, Johnny Kuan and the anonymous reviewers for their valuable comments on this work.  ... 
doi:10.1109/ispass.2009.4919648 dblp:conf/ispass/BakhodaYFWA09 fatcat:dxsomfd3wzce3mlya5ek2jwcsq
« Previous Showing results 1 — 15 out of 726 results