429 Hits in 3.6 sec

A Case for a Flexible Scalar Unit in SIMT Architecture

Yi Yang, Ping Xiang, Michael Mantor, Norman Rubin, Lisa Hsu, Qunfeng Dong, Huiyang Zhou
2014 2014 IEEE 28th International Parallel and Distributed Processing Symposium  
The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing.  ...  To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit.  ...  ACKNOWLEDGEMENTS We thank the anonymous reviewers for their insightful comments to improve our paper. This work is supported by an NSF grant CCF-1216569 and a NSF CAREER award CCF-0968667.  ... 
doi:10.1109/ipdps.2014.21 dblp:conf/ipps/YangXMRHDZ14 fatcat:7wpm74uwzbfmhew23viaghu2qi

Convergence and scalarization for data-parallel architectures

Yunsup Lee, R. Krashinsky, V. Grover, S. W. Keckler, K. Asanovic
2013 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess  ...  Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code.  ...  Marathe for valuable discussions on the convergence analysis algorithm.  ... 
doi:10.1109/cgo.2013.6494995 dblp:conf/cgo/LeeKGKA13 fatcat:kdtdw6xtzbejpfwuxkdsu7vrim

Characterizing scalar opportunities in GPGPU applications

Zhongliang Chen, David Kaeli, Norman Rubin
2013 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)  
We then evaluate the impact of scalar units on a heterogeneous scalar-vector GPU architecture.  ...  To better serve those operations, modern GPUs are armed with scalar units.  ...  The authors would also like to thank the GPGPU-Sim and Ocelot teams for use of their toolsets.  ... 
doi:10.1109/ispass.2013.6557173 dblp:conf/ispass/ChenKR13 fatcat:ttwlt5v5rzaajavqilbkysfnqm

Scalable parallel programming with CUDA

John Nickolls, Ian Buck, Michael Garland, Kevin Skadron
2008 ACM SIGGRAPH 2008 classes on - SIGGRAPH '08  
Brook for GPUs differentiates between FIFO input/output streams and random-access gather streams, and it supports parallel reductions.  ...  CUDA is supported on NVIDIA GPUs with the Tesla unified graphics and computing architecture of the GeForce 8-series, recent Quadro, Tesla, and future GPUs. x [1] x [2] x [3] x [4] x [5] x [6] x [  ...  On a serial processor, we would write a simple loop with a single accumulator variable to construct the sum of all elements in sequence.  ... 
doi:10.1145/1401132.1401152 dblp:conf/siggraph/NickollsBGS08 fatcat:tpkredxv3bgzva5unz23z7okz4


Anita Tino, Caroline Collange, André Seznec
2020 ACM Transactions on Architecture and Code Optimization (TACO)  
to 4T SMT ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 1, Article 1, Publication date: January 2020.  ...  Instructions fetched previously on the wrong path eventually commit with the null mask, but do not affect the architectural state of SIMT-X.  ... 
doi:10.1145/3392032 fatcat:miw637fpyfagvaulfnhoekjwtq

Scalable parallel programming

John Nickolls, Ian Buck, Michael Garland
2008 2008 IEEE Hot Chips 20 Symposium (HCS)  
CUDA is supported on NVIDIA GPUs with the Tesla unified graphics and computing architecture of the GeForce 8-series, recent Quadro, Tesla, and future GPUs.  ...  Brook for GPUs differentiates between FIFO input/output streams and random-access gather streams, and it supports parallel reductions.  ...  LET US KNOW or JOHN NICKOLLS is director of architecture at NVIDIA for GPU computing.  ... 
doi:10.1109/hotchips.2008.7476525 fatcat:ue6r5stwf5cqdb2x7ludiircuu

A control-structure splitting optimization for GPGPU

Snaider Carrillo, Jakob Siegel, Xiaoming Li
2009 Proceedings of the 6th ACM conference on Computing frontiers - CF '09  
Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance.  ...  Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads  ...  The SIMT architecture can be ineffective for algorithms that require diverging control flow decisions, such as those generated from if-else statements, because the concurrency among threads will be reduced  ... 
doi:10.1145/1531743.1531766 dblp:conf/cf/CarrilloSL09 fatcat:rjc3rja5efbahf6st5bf36sxiu

Towards parallel and distributed computing on GPU for American basket option pricing

Michael Benguigui, Francoise Baude
2012 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings  
Some optimizations are exposed to get good performance of our parallel algorithm on GPU. In order to benefit from different GPU devices, a dynamic strategy of kernel calibration is proposed.  ...  Future work is geared towards the use of distributed computing infrastructures such as Grids and Clouds, equipped with GPUs, in order to benefit for even more parallelism in solving such computing intensive  ...  In [15] are introduced software optimizations. One of these targets divergent if-then-else branches in loops: at every iteration it groups same execution paths in a warp, delaying the others.  ... 
doi:10.1109/cloudcom.2012.6427593 dblp:conf/cloudcom/BenguiguiB12 fatcat:b3z4bskpfndsfnhuskdrecqnya

Fusion of Calling Sites

Douglas do Couto Teixeira, Sylvain Collange, Fernando Magno Quintao Pereira
2015 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)  
Divergences may impose a heavy burden on the performance of parallel programs. In this paper we propose a compilerlevel optimization to mitigate this performance loss.  ...  One of these problems is the reconvergence of divergent threads. A divergence happens at a conditional branch when different threads disagree on the path to follow upon reaching this split point.  ...  In SIMT architectures, only paths that are executed by at least one thread are visited.  ... 
doi:10.1109/sbac-pad.2015.16 dblp:conf/sbac-pad/TeixeiraCP15 fatcat:bdzwo2dthje6xhb6lcbqjgjsga

DARM: Control-Flow Melding for SIMT Thread Divergence Reduction – Extended Version [article]

Charitha Saumya, Kirshanthan Sundararajah, Milind Kulkarni
2022 arXiv   pre-print
We observe that certain GPGPU kernels with control-flow divergence have similar control-flow structures with similar instructions on both sides of a branch.  ...  The control-flow divergence causes performance degradation because both paths of the branch must be executed one after the other.  ...  We would like to thank Tim Rogers for his feedback during discussions of this work and also providing us AMD GPUs for the experiments.  ... 
arXiv:2107.05681v3 fatcat:yda2r426rfdqdd2csn7w47usby

Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework

Asim Munawar, Mohamed Wahib, Masaharu Munetomo, Kiyoshi Akama
2009 Genetic Programming and Evolvable Machines  
GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, same is the case with conventional LS algorithms.  ...  We also discuss the effects of different optimization techniques on the overall execution time.  ...  Acknowledgements We would like to thank Dalila Boughachi for her help, regarding the use of genetic algorithms to solve MAX-SAT problem.  ... 
doi:10.1007/s10710-009-9091-4 fatcat:oxxjtqnderdnhamtl4huueioxa

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym
2008 IEEE Micro  
Erik Lindholm John Nickolls Stuart Oberman John Montrym NVIDIA Acknowledgments We thank the entire NVIDIA GPU development team for their extraordinary effort in bringing Tesla-based GPUs to market.  ...  It eliminates dead code, folds instructions together when feasible, and optimizes SIMT branch divergence and convergence points. Instruction set architecture.  ...  SIMD vector architectures, on the other hand, require the software to manually coalesce loads into vectors and to manually manage divergence. SIMT warp scheduling.  ... 
doi:10.1109/mm.2008.31 fatcat:dfatzl4dwzcjvg7e5ozkbygrli

Software-based branch predication for AMD GPUs

Ryan Taylor, Xiaoming Li
2011 SIGARCH Computer Architecture News  
of instruction on the GPU with little to no overhead.  ...  Due to the SIMD nature and massive multi-threading architecture of the GPU, branching can be costly if more than one path is taken by a set of concurrent threads in a kernel.  ...  The code was then optimized for parallelism and for the AMD GPU architecture and StreamSDK.  ... 
doi:10.1145/1926367.1926379 fatcat:pzzhdhig2zaxtprs7ldaepx7qi

Warp-aware trace scheduling for GPUs

James A. Jablin, Thomas B. Jablin, Onur Mutlu, Maurice Herlihy
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
Here, we propose "Warp-Aware Trace Scheduling" for GPUs.  ...  GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP).  ...  While this subject is well-understood for CPU architectures, it has received little attention for GPUs. Modern GPU architectures [30] have neglected ILP for two major reasons.  ... 
doi:10.1145/2628071.2628101 dblp:conf/IEEEpact/JablinJMH14 fatcat:crjqndrorjhiddj5c2hmvmlbom

GPU ray tracing

Steven G. Parker, Greg Humphreys, Morgan McGuire, Martin Stich, Heiko Friedrich, David Luebke, Keith Morley, James Bigler, Jared Hoberock, David McAllister, Austin Robison, Andreas Dietrich
2013 Communications of the ACM  
The NVIDIA ® OptiX ™ ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures.  ...  For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mechanism similar to virtual function calls.  ...  At first blush, this is a challenge for GPUs that rely on SIMT execution for efficiency.  ... 
doi:10.1145/2447976.2447997 fatcat:fljznpvxmfbsfbqly2slgpcxdm
« Previous Showing results 1 — 15 out of 429 results