561 Hits in 3.3 sec

A characterization and analysis of PTX kernels

Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili
2009 2009 IEEE International Symposium on Workload Characterization (IISWC)  
While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist  ...  The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) -a machine model and low level  ...  ACKNOWLEDGEMENTS The authors gratefully acknowledge the generous support of this work by LogicBlox Inc., IBM Corp., and NVIDIA Corp. both through research grants, fellowships, as well as technical interactions  ... 
doi:10.1109/iiswc.2009.5306801 dblp:conf/iiswc/KerrDY09 fatcat:mz3sbt3drrdnlbo46nqm5gtwmi

A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Naila Farooqui, Andrew Kerr, Gregory Diamos, S. Yalamanchili, K. Schwan
2011 Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4  
In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures.  ...  On average, compilation overheads due to instrumentation consisted of 69% of the time needed to parse a kernel module, in the case of the Parboil benchmark suite.  ...  We gratefully acknowledge the insights of Vishakha Gupta and Alexander Merritt for their suggestion of the online credit-based scheduler.  ... 
doi:10.1145/1964179.1964192 dblp:conf/asplos/FarooquiKDYS11 fatcat:lyjk3cwwvzaopcbxwkei2lrqku

Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures

Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar Yalamanchili
2012 2012 IEEE International Symposium on Performance Analysis of Systems & Software  
The paper concludes with a comparative analysis of Lynx with existing GPU profiling tools and a quantitative assessment of Lynx's instrumentation performance, providing insights into optimization opportunities  ...  for running instrumented GPU kernels.  ...  ACKNOWLEDGEMENTS This research was supported by NSF under grants CCF-0905459, OCI-0910735, and IIP-1032032.  ... 
doi:10.1109/ispass.2012.6189206 dblp:conf/ispass/FarooquiKESY12 fatcat:pcfzrlhmorhq7c2i4ppjzfyuy4

Modeling GPU-CPU workloads and systems

Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili
2010 Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10  
Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous  ...  metrics that are available before a kernel is executed.  ...  Acknowledgements The authors gratefully acknowledge the generous support of this work by LogicBlox Inc., IBM Corp., and NVIDIA Corp. both through research grants, fellowships, as well as technical interactions  ... 
doi:10.1145/1735688.1735696 dblp:conf/asplos/KerrDY10 fatcat:wnmw2zpc5vf6dhgo6slczg7ila

Characterizing and enhancing global memory data coalescing on GPUs

Naznin Fauzia, Louis-Noel Pouchet, P. Sadayappan
2015 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
We develop a two-stage framework where dynamic analysis is first used to detect and characterize uncoalesced accesses in arbitrary PTX programs.  ...  Experimental results demonstrate the use of the tools on a number of benchmarks from the Rodinia and Polybench suites.  ...  National Science Foundation through awards 0926127, 1321147 and 1440749.  ... 
doi:10.1109/cgo.2015.7054183 dblp:conf/cgo/FauziaPS15 fatcat:cu4ipgwwbrdjtetkr4vjr5oo6m


Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, Nathan Clark
2010 Proceedings of the 19th international conference on Parallel architectures and compilation techniques - PACT '10  
This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance.  ...  Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs  ...  [14] and extended by the same authors in [14] , and 2) a dynamic compiler from PTX to Cell by Diamos et al. [15] , and 3) a characterization of the dynamic behavior of PTX workloads by Kerr et al  ... 
doi:10.1145/1854273.1854318 dblp:conf/IEEEpact/DiamosKYC10 fatcat:l5xp67cqnjaxzlwdtk5ceb6loi

Optimal loop unrolling for GPGPU programs

Giridhar Sreenivasa Murthy, Mahesh Ravishankar, Muthu Manikandan Baskaran, P. Sadayappan
2010 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)  
We use these techniques to evaluate the effect of loop unrolling on a range of GPGPU programs and show that we correctly identify the optimal unroll factors, and that these optimized versions run up to  ...  Sadayappan for his advice and guidance throughout the duration of my Masters study. He has been a huge inspiration since my first CSE 621 class.  ...  Description: The PTX Analyzer is a static program analysis tool that consumes the disassembled PTX representation of a CUDA kernel and rebuilds the Control Flow Graph (CFG) [7] and detects loops in the  ... 
doi:10.1109/ipdps.2010.5470423 dblp:conf/ipps/MurthyRBS10 fatcat:i2yoiawuhnbrrepry3667luci4


Rodrigo Domínguez, Dana Schaa, David Kaeli
2011 Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4  
Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications.  ...  Given the fact that CUDA C has been on the market for a number of years, a large number of applications have been developed in the HPC community.  ...  The authors would like to thank the members of the Ocelot mailing list, especially Gregory F. Diamos and Andrew R. Kerr, for their helpful discussions and their comments on our work.  ... 
doi:10.1145/1964179.1964186 dblp:conf/asplos/DominguezSK11 fatcat:p3wi6reknbbajdbtcyic7w5iru

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs [article]

Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Nandakishore Santhi, Stephan Eidenbenz
2019 arXiv   pre-print
In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies  ...  The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately.  ...  We used parallel thread execution (PTX) [21] to perform our analysis. PTX is a pseudo-assembly language used in NVIDIA's CUDA programming environment.  ... 
arXiv:1905.08778v2 fatcat:physx7sjmfh4bjzgb2r5ypni6q

Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications

Nilanjan Goswami, Ramkumar Shankar, Madhura Joshi, Tao Li
2010 IEEE International Symposium on Workload Characterization (IISWC'10)  
We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia.  ...  Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces.  ...  ACKNOWLEDGEMENTS This work is supported in part by NSF grants CNS-0834288, CCF-0845721 (CAREER), SRC grant 2008-HJ-1798, and by three IBM Faculty Awards.  ... 
doi:10.1109/iiswc.2010.5649549 dblp:conf/iiswc/GoswamiSJL10 fatcat:jw66a5zr6bbvxo32mmezqi5u64

Warp-aware trace scheduling for GPUs

James A. Jablin, Thomas B. Jablin, Onur Mutlu, Maurice Herlihy
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
cycle (IPC) by a harmonic mean of 1.12× and reducing instruction serialization and total instructions executed.  ...  As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per  ...  [6] propose a static branch divergence analysis.  ... 
doi:10.1145/2628071.2628101 dblp:conf/IEEEpact/JablinJMH14 fatcat:crjqndrorjhiddj5c2hmvmlbom

Kernel-Based Learning for Statistical Signal Processing in Cognitive Radio Networks: Theoretical Foundations, Example Applications, and Future Directions

Guoru Ding, Qihui Wu, Yu-Dong Yao, Jinlong Wang, Yingying Chen
2013 IEEE Signal Processing Magazine  
Her work has involved a combination of research and development of new technologies and real systems.  ...  She is currently an associate professor in the Department of Electrical and Computer Engineering at Stevens Institute of Technology.  ...  Besides SVMs, the most well-known kernel methods include kernel Fisher discriminant analysis (FDA) [19] , kernel K-means clustering [20] , kernel principal component analysis (PCA) [5] , and kernel  ... 
doi:10.1109/msp.2013.2251071 fatcat:gsz5mc6nbjcdzezobmqo5wmx2q

Dynamic compilation of data-parallel kernels for vector processors

Andrew Kerr, Gregory Diamos, S. Yalamanchili
2012 Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12  
This work applies dynamic compilation to explicitly data-parallel kernels and describes a set of program transformations that efficiently compile bulk-synchronous scalar kernels for SIMD functional units  ...  It is agnostic to specific features of ISAs, and performance scalability is expected from 2-wide to arbitrary-width vector units.  ...  their recommendations and feedback.  ... 
doi:10.1145/2259016.2259020 dblp:conf/cgo/KerrDY12 fatcat:7gpks5jk5zhdjkfqhtvcvq34we

Modeling Deep Learning Accelerator Enabled GPUs [article]

Md Aamir Raihan, Negar Goli, Tor Aamodt
2019 arXiv   pre-print
The efficacy of deep learning has resulted in its use in a growing number of applications.  ...  In this paper we study the design of the tensor cores in NVIDIA's Volta and Turing architectures. We further propose an architectural model for the tensor cores in Volta.  ...  COHESA is financed under the National Sciences and Engineering Research Council of Canada (NSERC) Strategic Networks grant number NETGP485577-15.  ... 
arXiv:1811.08309v2 fatcat:sjdjievr55hfjc7vqd6ilo4vjm

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Dennis Abts, John Kim
2011 Synthesis Lectures on Computer Architecture  
Acknowledgments First we would like to thank Mark Hill and Michael Morgan for having invited us to write a synthesis lecture and for their support. Many thanks to reviews from Tor M. Aamodt  ...  PTX Emulation and Trace Analysis GPU Ocelot's PTX emulator executes CUDA kernels at the PTX level and provides the complete architectural state of a GPU for each dynamically executed instruction.  ...  Ocelot's complete implementation of the CUDA Runtime API, rich set of PTX analysis passes, and kernel transformation pass manager offer a powerful platform for developing additional profiling and analysis  ... 
doi:10.2200/s00341ed1v01y201103cac014 fatcat:rjpziqdnezdrnhfiygrg3jdz4m
« Previous Showing results 1 — 15 out of 561 results