A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Improving GPGPU resource utilization through alternative thread block scheduling
2014
2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
contention. • This paper proposes an approach of considering both warp scheduler and block scheduler to improve the efficiency in GPGPU architecture. ...
a GPGPU: -Thread block/CTA scheduler: assign CTAs to cores -Warp/wavefront scheduler: determine which warp is executed • There has been work on different warp schedulers: cacheconscious wavefront scheduling ...
Conclusion • LCS (lazy CTA scheduling): leverage a greedy warp scheduler to determine the optimal number of thread blocks per core • BCS (block CTA scheduling): exploit inter-CTA locality to improve overall ...
doi:10.1109/hpca.2014.6835937
dblp:conf/hpca/LeeSMKSCR14
fatcat:lu6zbvl57vhavmnnzxg3sqwnqi
Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications
2020
Applied Sciences
To schedule threads in GPGPU, a specialized hardware scheduler allocates thread blocks to the computing unit called SM (Stream Multiprocessors) in a Round-Robin manner. ...
We implement our model as a GPGPU scheduling simulator and show that the conventional thread block scheduling provided in GPGPU hardware does not perform well as the workload becomes heavy. ...
Thus, developing simulators is an alternative way for evaluating the thread block schedulers. ...
doi:10.3390/app10249121
fatcat:jeg5y2y6yzgzpgydxyyzjm6twe
Exploring the limits of GPGPU scheduling in control flow bound applications
2012
ACM Transactions on Architecture and Code Optimization (TACO)
We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction. ...
We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence. ...
After block is scheduled, its resources are freed only when all the threads within the block have finished. ...
doi:10.1145/2086696.2086708
fatcat:53gaayk7zrgbfbsj3sgphauade
policy can improve GPGPU applications' performance by 17% on average. ...
The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for ...
Since a thread block cannot be scheduled to execute until all warps in the previous thread block have finished executing, this results in sub-optimal computing resource utilization, as the warps which ...
doi:10.1145/2628071.2628107
dblp:conf/IEEEpact/LeeW14
fatcat:nae6dcoe4fe45b3we6gih6ttwq
Efficient utilization of GPGPU cache hierarchy
2015
Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015
In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. ...
However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead. ...
Special thanks go to Ahmed ElTantawy for his assistance with GPGPU-sim tool. ...
doi:10.1145/2716282.2716291
dblp:conf/ppopp/KhairyZW15
fatcat:l5jwwrzbyzaqtmljn7h7yssueq
Improving GPGPU concurrency with elastic kernels
2013
SIGPLAN notices
Current GPUs therefore allow concurrent execution of kernels to improve utilization. ...
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. ...
We thank our anonymous reviewers and our shepherd, Rodric Rabbah, for their feedback which has significantly improved this work. ...
doi:10.1145/2499368.2451160
fatcat:ljkrkgicvnavbeyugydbpqhpf4
Improving GPGPU concurrency with elastic kernels
2013
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems - ASPLOS '13
Current GPUs therefore allow concurrent execution of kernels to improve utilization. ...
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. ...
We thank our anonymous reviewers and our shepherd, Rodric Rabbah, for their feedback which has significantly improved this work. ...
doi:10.1145/2451116.2451160
dblp:conf/asplos/PaiTG13
fatcat:glstzpry65goxjx4pdqgnnqdxm
Covert Channels on GPGPUs
2017
IEEE computer architecture letters
In either mode, we identify the shared resources that may be used to create contention. ...
We reverse engineer the block placement algorithm to understand co-residency of blocks from different applications on the same Streaming Multiprocessor (SM) core, or on different SMs concurrently. ...
COLOCATING APPLICATIONS ON GPGPU Our goal is to create covert channels through contention on shared hardware resources in the GPGPU. ...
doi:10.1109/lca.2016.2590549
fatcat:wmfwf7sswvc4xalhpwsmymzsnq
GPUSync: A Framework for Real-Time GPU Management
2013
2013 IEEE 34th Real-Time Systems Symposium
Specifically, it can be applied under either static-or dynamicpriority CPU scheduling; can allocate CPUs/GPUs on a partitioned, clustered, or global basis; provides flexible mechanisms for allocating GPUs ...
provides migration cost predictors that determine when migrations can be effective; enables a single GPU's different engines to be accessed in parallel; properly supports GPU-related interrupt and worker threads ...
GPUSync improves upon this earlier work by also properly managing user-space helper threads, which are utilized in the latest closed-source GPGPU runtimes. ...
doi:10.1109/rtss.2013.12
dblp:conf/rtss/ElliottWA13
fatcat:6cb7ch54prbo7a3xkrekmivx2q
Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance
[article]
2018
arXiv
pre-print
In a modern GPU architecture, all threads within a warp execute the same instruction in lockstep. ...
For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. ...
how much of a particular memory resource to allocate to each thread block. ...
arXiv:1804.11038v1
fatcat:tsvp3wmj2rcn3kesyzfsbndb2q
For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%. ...
Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs. ...
When a kernel is launched, each thread block is scheduled to one of the SMs. Depending on the resource constraints, the number of thread blocks that can run simultaneously on an SM may vary. ...
doi:10.1145/2694344.2694346
dblp:conf/asplos/ParkPM15
fatcat:kaqig6ktkjhfrjxgbaavpiy33i
Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs
2010
2010 39th International Conference on Parallel Processing
However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. ...
Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously ...
Threads in the same thread block can cooperate by barrier-synchronizing their memory accesses and can share data through the shared memory. ...
doi:10.1109/icpp.2010.13
dblp:conf/icpp/DiWZWX10
fatcat:udvg2zldevb5lm6r3yafb2km5q
Cooperative GPGPU Scheduling for Consolidating Server Workloads
2018
IEICE transactions on information and systems
The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them. ...
Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. ...
Existing GPU resource managers, including GPU command-based schedulers [24] - [26] , novel GPU kernel launchers [27] , [28] , and thread block schedulers [29] , [30] , fail to schedule GPU eaters ...
doi:10.1587/transinf.2018edp7027
fatcat:gmgxosap7nhxjanmgcv3uniw3y
Cortical architectures on a GPGPU
2010
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10
We also consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. ...
The GPGPU is a readily-available architecture that fits well with the parallel cortical architecture inspired by the basic building blocks of the human brain. ...
Section 6 describes a method to pipeline the training stages of the cortical architecture to improve resource utilization on the GPGPU and presents some performance results. ...
doi:10.1145/1735688.1735693
dblp:conf/asplos/NereL10
fatcat:rsyqfc54rngphhmz2v7ng7j6iq
FlexGrip: A soft GPGPU for FPGAs
2013
2013 International Conference on Field-Programmable Technology (FPT)
In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. ...
This architecture supports direct CUDA compilation to a binary which is executable on the FPGAbased GPGPU without hardware recompilation. ...
The block scheduler is responsible for scheduling thread blocks in a round-robin fashion. ...
doi:10.1109/fpt.2013.6718358
dblp:conf/fpt/AndrycMT13
fatcat:7ey67anaezbj7p7dgz2qtlnzty
« Previous
Showing results 1 — 15 out of 1,017 results