Filters








1,452 Hits in 4.9 sec

Exploring Memory Persistency Models for GPUs [article]

Zhen Lin, Mohammad Alshboul, Yan Solihin, Huiyang Zhou
2019 arXiv   pre-print
We design a pragma-based compiler scheme to express persistency models for GPUs. We identify that the thread hierarchy in GPUs offers intuitive scopes to form epochs and durable transactions.  ...  Considering the importance of GPUs as a dominant accelerator for high performance computing, we investigate persistency models for GPUs.  ...  We also found that l2wb is profitable in cases where it is difficult/expensive to re-generate addresses required for clwb. We also use the membar instruction as a persist barrier between epochs.  ... 
arXiv:1904.12661v1 fatcat:2ofeqybxabb25e2u327qv677fu

Portable inter-workgroup barrier synchronisation for GPUs

Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, Zvonimir Rakamarić
2016 Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications - OOPSLA 2016  
This work was supported in part by an equipment grant from GCHQ, a gift from Intel Corporation, an EPSRC Impact Acceleration Award, the Royal Academy of Engineering, the Lloyds Register Foundation, NSF  ...  this work, and the OOPLSA reviewers (paper and artifact) for their thorough evaluations and feedback which greatly improved this paper.  ...  OpenCL An OpenCL program comprises host code, executed on the CPU, and device code, executed on a device (in our case a GPU).  ... 
doi:10.1145/2983990.2984032 dblp:conf/oopsla/SorensenDBGR16 fatcat:wggfexzmkbc5xfhxzje7g6tvqq

Portable inter-workgroup barrier synchronisation for GPUs

Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, Zvonimir Rakamarić
2016 SIGPLAN notices  
This work was supported in part by an equipment grant from GCHQ, a gift from Intel Corporation, an EPSRC Impact Acceleration Award, the Royal Academy of Engineering, the Lloyds Register Foundation, NSF  ...  this work, and the OOPLSA reviewers (paper and artifact) for their thorough evaluations and feedback which greatly improved this paper.  ...  OpenCL An OpenCL program comprises host code, executed on the CPU, and device code, executed on a device (in our case a GPU).  ... 
doi:10.1145/3022671.2984032 fatcat:3abc6txrufao3m66pcprvhzl74

Active Data Structures on GPGPUs [chapter]

John T. O'Donnell, Cordelia Hall, Stuart Monro
2014 Lecture Notes in Computer Science  
General purpose GPUs were designed to support regular graphics algorithms, but their intermediate level of granularity makes them potentially viable also for active data structures.  ...  Active data structures support operations that may affect a large number of elements of an aggregate data structure.  ...  The GPU system provides a primitive barrier synchronisation for threads within a block, but not for threads in different blocks.  ... 
doi:10.1007/978-3-642-54420-0_85 fatcat:n4eon7pxtjhvbnprlbsshpnkfu

The Hitchhiker's Guide to Cross-Platform OpenCL Application Development

Tyler Sorensen, Alastair F. Donaldson
2016 Proceedings of the 4th International Workshop on OpenCL - IWOCL '16  
To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted to execute across a variety of GPU platforms, via OpenCL  ...  improve the state of OpenCL portability; we conclude with a discussion of these.  ...  On reaching a barrier a thread waits until all threads in its workgroup have reached the barrier. Barriers can be used for deterministic communication.  ... 
doi:10.1145/2909437.2909440 dblp:conf/iwocl/SorensenD16 fatcat:r7wn32p7kbbkjb3si6u7dnfn7m

Cortex: A Compiler for Recursive Deep Learning Models [article]

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
2021 arXiv   pre-print
In this paper, we present Cortex, a compiler-based approach to generate highly-efficient code for recursive models for low latency inference.  ...  This approach often leaves significant performance on the table, especially for the case of recursive deep learning models.  ...  In this case, after unrolling, the cost of a barrier cannot be amortized across all nodes in a batch, as illustrated in Fig. 11 .  ... 
arXiv:2011.01383v2 fatcat:p4vtyqjenjeolfyfn527bbv2oe

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs [chapter]

John A. Stratton, Sam S. Stone, Wen-mei W. Hwu
2008 Lecture Notes in Computer Science  
This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs.  ...  CUDA is a data parallel programming model that supports several key abstractions -thread blocks, hierarchical memory and barrier synchronization -for writing applications.  ...  Acknowledgements We would like to thank Micheal Garland, John Owens, Chris Rodrigues, Vinod Grover and NVIDIA corporation for their feedback and support.  ... 
doi:10.1007/978-3-540-89740-8_2 fatcat:wdg3yrut2jcxzpxunxaxz2zvky

Persistent Kernels for Iterative Memory-bound GPU Applications [article]

Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Satoshi Matsuoka
2022 arXiv   pre-print
In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization.  ...  We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS).  ...  We propose a persistent kernel execution scheme for iterative GPU applications. We enhance performance by moving the time loop to the kernel and caching the intermediate output of each time step.  ... 
arXiv:2204.02064v2 fatcat:campsz22iff5jfdmo7nrth7xje

Cooperative kernels: GPU multitasking for blocking algorithms

Tyler Sorensen, Hugues Evrard, Alastair F. Donaldson
2017 Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2017  
We describe a prototype implementation of a cooperative kernel framework implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining  ...  Current approaches avoid this issue by exploiting scheduling quirks of today's GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks).  ...  We thank the FSE reviewers for their thorough evaluations and feedback. This work is supported in part by EPSRC Fellowship EP/N026314, and a gift from Intel Corporation.  ... 
doi:10.1145/3106237.3106265 dblp:conf/sigsoft/SorensenED17 fatcat:2vyp6qhwkvbunliy3ntykar57a

Verification of producer-consumer synchronization in GPU programs

Rahul Sharma, Michael Bauer, Alex Aiken
2015 Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2015  
In this work we present the first formal operational semantics for named barriers and define what it means for a warpspecialized kernel to be correct.  ...  We also present WEFT, a verification tool for checking warp-specialized code. Using WEFT, we discover several non-trivial bugs in production warp-specialized kernels.  ...  Acknowledgements We wish to thank Michael Garland, Vinod Grover, Divya Gupta, Manolis Papadakis, Sean Treichler, and the anonymous reviewers for their valuable comments and feedback.  ... 
doi:10.1145/2737924.2737962 dblp:conf/pldi/SharmaBA15 fatcat:r53doygmrrexdketgm6juge44y

Verification of producer-consumer synchronization in GPU programs

Rahul Sharma, Michael Bauer, Alex Aiken
2015 SIGPLAN notices  
In this work we present the first formal operational semantics for named barriers and define what it means for a warpspecialized kernel to be correct.  ...  We also present WEFT, a verification tool for checking warp-specialized code. Using WEFT, we discover several non-trivial bugs in production warp-specialized kernels.  ...  Acknowledgements We wish to thank Michael Garland, Vinod Grover, Divya Gupta, Manolis Papadakis, Sean Treichler, and the anonymous reviewers for their valuable comments and feedback.  ... 
doi:10.1145/2813885.2737962 fatcat:pqgc4ciworchfd7a3njooiienu

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, Wu-Chun Feng
2016 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)  
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication  ...  The quantification, in turn, provides insight as to when to use each of the aforementioned synchronization mechanisms in a target application.  ...  As such, explicit GPU global synchronization is opted to be out of scope of this paper.  ... 
doi:10.1109/mascots.2016.58 dblp:conf/mascots/HarbF16 fatcat:s5q4kd2arvcpziwovbr67k4rvi

Free launch

Guoyang Chen, Xipeng Shen
2015 Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48  
Supporting dynamic parallelism is important for GPU to benefit a broad range of applications.  ...  There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel  ...  The experiments benefit from the GPU devices donated by NVIDIA.  ... 
doi:10.1145/2830772.2830818 dblp:conf/micro/ChenS15 fatcat:4kc7oxcd4bclxf3pahkjnrfax4

Heterogeneous-race-free memory models

Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, David A. Wood
2014 Proceedings of the 19th international conference on Architectural support for programming languages and operating systems - ASPLOS '14  
., integrated CPUs and GPUs), now support a unified, shared memory address space for all components.  ...  We quantitatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs. 1 Sub-groups are optional  ...  We also thank Marc Orr and Shuai Che for their participation in many constructive discussions.  ... 
doi:10.1145/2541940.2541981 dblp:conf/asplos/HowerHBGHRW14 fatcat:iehbe3fbrff33erb6qfxclgi5i

Efficient Synchronization Primitives for GPUs [article]

Jeff A. Stuart, John D. Owens
2011 arXiv   pre-print
In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU.  ...  We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common  ...  Their success in designing and optimizing a fast GPU barrier inspires our work here. Like barriers, CPU mutex algorithms tend to be ill-suited for the GPU.  ... 
arXiv:1110.4623v1 fatcat:fkdbd7kgcneibbmjohk6mmgh4y
« Previous Showing results 1 — 15 out of 1,452 results