1,017 Hits in 4.1 sec

Improving GPGPU resource utilization through alternative thread block scheduling

Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, Soojung Ryu
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
contention. • This paper proposes an approach of considering both warp scheduler and block scheduler to improve the efficiency in GPGPU architecture.  ...  a GPGPU: -Thread block/CTA scheduler: assign CTAs to cores -Warp/wavefront scheduler: determine which warp is executed • There has been work on different warp schedulers: cacheconscious wavefront scheduling  ...  Conclusion • LCS (lazy CTA scheduling): leverage a greedy warp scheduler to determine the optimal number of thread blocks per core • BCS (block CTA scheduling): exploit inter-CTA locality to improve overall  ... 
doi:10.1109/hpca.2014.6835937 dblp:conf/hpca/LeeSMKSCR14 fatcat:lu6zbvl57vhavmnnzxg3sqwnqi

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications

KyungWoon Cho, Hyokyung Bahn
2020 Applied Sciences  
To schedule threads in GPGPU, a specialized hardware scheduler allocates thread blocks to the computing unit called SM (Stream Multiprocessors) in a Round-Robin manner.  ...  We implement our model as a GPGPU scheduling simulator and show that the conventional thread block scheduling provided in GPGPU hardware does not perform well as the workload becomes heavy.  ...  Thus, developing simulators is an alternative way for evaluating the thread block schedulers.  ... 
doi:10.3390/app10249121 fatcat:jeg5y2y6yzgzpgydxyyzjm6twe

Exploring the limits of GPGPU scheduling in control flow bound applications

Roman Malits, Evgeny Bolotin, Avinoam Kolodny, Avi Mendelson
2012 ACM Transactions on Architecture and Code Optimization (TACO)  
We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction.  ...  We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence.  ...  After block is scheduled, its resources are freed only when all the threads within the block have finished.  ... 
doi:10.1145/2086696.2086708 fatcat:53gaayk7zrgbfbsj3sgphauade


Shin-Ying Lee, Carole-Jean Wu
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
policy can improve GPGPU applications' performance by 17% on average.  ...  The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for  ...  Since a thread block cannot be scheduled to execute until all warps in the previous thread block have finished executing, this results in sub-optimal computing resource utilization, as the warps which  ... 
doi:10.1145/2628071.2628107 dblp:conf/IEEEpact/LeeW14 fatcat:nae6dcoe4fe45b3we6gih6ttwq

Efficient utilization of GPGPU cache hierarchy

Mahmoud Khairy, Mohamed Zahran, Amr G. Wassal
2015 Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015  
In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches.  ...  However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead.  ...  Special thanks go to Ahmed ElTantawy for his assistance with GPGPU-sim tool.  ... 
doi:10.1145/2716282.2716291 dblp:conf/ppopp/KhairyZW15 fatcat:l5jwwrzbyzaqtmljn7h7yssueq

Improving GPGPU concurrency with elastic kernels

Sreepathi Pai, Matthew J. Thazhuthaveetil, R. Govindarajan
2013 SIGPLAN notices  
Current GPUs therefore allow concurrent execution of kernels to improve utilization.  ...  Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources.  ...  We thank our anonymous reviewers and our shepherd, Rodric Rabbah, for their feedback which has significantly improved this work.  ... 
doi:10.1145/2499368.2451160 fatcat:ljkrkgicvnavbeyugydbpqhpf4

Improving GPGPU concurrency with elastic kernels

Sreepathi Pai, Matthew J. Thazhuthaveetil, R. Govindarajan
2013 Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems - ASPLOS '13  
Current GPUs therefore allow concurrent execution of kernels to improve utilization.  ...  Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources.  ...  We thank our anonymous reviewers and our shepherd, Rodric Rabbah, for their feedback which has significantly improved this work.  ... 
doi:10.1145/2451116.2451160 dblp:conf/asplos/PaiTG13 fatcat:glstzpry65goxjx4pdqgnnqdxm

Covert Channels on GPGPUs

Hoda Naghibijouybari, Nael Abu-Ghazaleh
2017 IEEE computer architecture letters  
In either mode, we identify the shared resources that may be used to create contention.  ...  We reverse engineer the block placement algorithm to understand co-residency of blocks from different applications on the same Streaming Multiprocessor (SM) core, or on different SMs concurrently.  ...  COLOCATING APPLICATIONS ON GPGPU Our goal is to create covert channels through contention on shared hardware resources in the GPGPU.  ... 
doi:10.1109/lca.2016.2590549 fatcat:wmfwf7sswvc4xalhpwsmymzsnq

GPUSync: A Framework for Real-Time GPU Management

Glenn A. Elliott, Bryan C. Ward, James H. Anderson
2013 2013 IEEE 34th Real-Time Systems Symposium  
Specifically, it can be applied under either static-or dynamicpriority CPU scheduling; can allocate CPUs/GPUs on a partitioned, clustered, or global basis; provides flexible mechanisms for allocating GPUs  ...  provides migration cost predictors that determine when migrations can be effective; enables a single GPU's different engines to be accessed in parallel; properly supports GPU-related interrupt and worker threads  ...  GPUSync improves upon this earlier work by also properly managing user-space helper threads, which are utilized in the latest closed-source GPGPU runtimes.  ... 
doi:10.1109/rtss.2013.12 dblp:conf/rtss/ElliottWA13 fatcat:6cb7ch54prbo7a3xkrekmivx2q

Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance [article]

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, Onur Mutlu
2018 arXiv   pre-print
In a modern GPU architecture, all threads within a warp execute the same instruction in lockstep.  ...  For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies.  ...  how much of a particular memory resource to allocate to each thread block.  ... 
arXiv:1804.11038v1 fatcat:tsvp3wmj2rcn3kesyzfsbndb2q


Jason Jong Kyu Park, Yongjun Park, Scott Mahlke
2015 Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '15  
For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%.  ...  Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs.  ...  When a kernel is launched, each thread block is scheduled to one of the SMs. Depending on the resource constraints, the number of thread blocks that can run simultaneously on an SM may vary.  ... 
doi:10.1145/2694344.2694346 dblp:conf/asplos/ParkPM15 fatcat:kaqig6ktkjhfrjxgbaavpiy33i

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue
2010 2010 39th International Conference on Parallel Processing  
However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads.  ...  Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously  ...  Threads in the same thread block can cooperate by barrier-synchronizing their memory accesses and can share data through the shared memory.  ... 
doi:10.1109/icpp.2010.13 dblp:conf/icpp/DiWZWX10 fatcat:udvg2zldevb5lm6r3yafb2km5q

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Yusuke SUZUKI, Hiroshi YAMADA, Shinpei KATO, Kenji KONO
2018 IEICE transactions on information and systems  
The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.  ...  Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation.  ...  Existing GPU resource managers, including GPU command-based schedulers [24] - [26] , novel GPU kernel launchers [27] , [28] , and thread block schedulers [29] , [30] , fail to schedule GPU eaters  ... 
doi:10.1587/transinf.2018edp7027 fatcat:gmgxosap7nhxjanmgcv3uniw3y

Cortical architectures on a GPGPU

Andrew Nere, Mikko Lipasti
2010 Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10  
We also consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources.  ...  The GPGPU is a readily-available architecture that fits well with the parallel cortical architecture inspired by the basic building blocks of the human brain.  ...  Section 6 describes a method to pipeline the training stages of the cortical architecture to improve resource utilization on the GPGPU and presents some performance results.  ... 
doi:10.1145/1735688.1735693 dblp:conf/asplos/NereL10 fatcat:rsyqfc54rngphhmz2v7ng7j6iq

FlexGrip: A soft GPGPU for FPGAs

Kevin Andryc, Murtaza Merchant, Russell Tessier
2013 2013 International Conference on Field-Programmable Technology (FPT)  
In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation.  ...  This architecture supports direct CUDA compilation to a binary which is executable on the FPGAbased GPGPU without hardware recompilation.  ...  The block scheduler is responsible for scheduling thread blocks in a round-robin fashion.  ... 
doi:10.1109/fpt.2013.6718358 dblp:conf/fpt/AndrycMT13 fatcat:7ey67anaezbj7p7dgz2qtlnzty
« Previous Showing results 1 — 15 out of 1,017 results