A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Case for a Flexible Scalar Unit in SIMT Architecture
2014
2014 IEEE 28th International Parallel and Distributed Processing Symposium
The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. ...
To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. ...
ACKNOWLEDGEMENTS We thank the anonymous reviewers for their insightful comments to improve our paper. This work is supported by an NSF grant CCF-1216569 and a NSF CAREER award CCF-0968667. ...
doi:10.1109/ipdps.2014.21
dblp:conf/ipps/YangXMRHDZ14
fatcat:7wpm74uwzbfmhew23viaghu2qi
Convergence and scalarization for data-parallel architectures
2013
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess ...
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. ...
Marathe for valuable discussions on the convergence analysis algorithm. ...
doi:10.1109/cgo.2013.6494995
dblp:conf/cgo/LeeKGKA13
fatcat:kdtdw6xtzbejpfwuxkdsu7vrim
Characterizing scalar opportunities in GPGPU applications
2013
2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
We then evaluate the impact of scalar units on a heterogeneous scalar-vector GPU architecture. ...
To better serve those operations, modern GPUs are armed with scalar units. ...
The authors would also like to thank the GPGPU-Sim and Ocelot teams for use of their toolsets. ...
doi:10.1109/ispass.2013.6557173
dblp:conf/ispass/ChenKR13
fatcat:ttwlt5v5rzaajavqilbkysfnqm
Scalable parallel programming with CUDA
2008
ACM SIGGRAPH 2008 classes on - SIGGRAPH '08
Brook for GPUs differentiates between FIFO input/output streams and random-access gather streams, and it supports parallel reductions. ...
CUDA is supported on NVIDIA GPUs with the Tesla unified graphics and computing architecture of the GeForce 8-series, recent Quadro, Tesla, and future GPUs. x [1] x [2] x [3] x [4] x [5] x [6] x [ ...
On a serial processor, we would write a simple loop with a single accumulator variable to construct the sum of all elements in sequence. ...
doi:10.1145/1401132.1401152
dblp:conf/siggraph/NickollsBGS08
fatcat:tpkredxv3bgzva5unz23z7okz4
to 4T SMT ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 1, Article 1, Publication date: January 2020. ...
Instructions fetched previously on the wrong path eventually commit with the null mask, but do not affect the architectural state of SIMT-X. ...
doi:10.1145/3392032
fatcat:miw637fpyfagvaulfnhoekjwtq
Scalable parallel programming
2008
2008 IEEE Hot Chips 20 Symposium (HCS)
CUDA is supported on NVIDIA GPUs with the Tesla unified graphics and computing architecture of the GeForce 8-series, recent Quadro, Tesla, and future GPUs. ...
Brook for GPUs differentiates between FIFO input/output streams and random-access gather streams, and it supports parallel reductions. ...
LET US KNOW feedback@acmqueue.com or www.acmqueue.com/forums JOHN NICKOLLS is director of architecture at NVIDIA for GPU computing. ...
doi:10.1109/hotchips.2008.7476525
fatcat:ue6r5stwf5cqdb2x7ludiircuu
A control-structure splitting optimization for GPGPU
2009
Proceedings of the 6th ACM conference on Computing frontiers - CF '09
Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance. ...
Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads ...
The SIMT architecture can be ineffective for algorithms that require diverging control flow decisions, such as those generated from if-else statements, because the concurrency among threads will be reduced ...
doi:10.1145/1531743.1531766
dblp:conf/cf/CarrilloSL09
fatcat:rjc3rja5efbahf6st5bf36sxiu
Towards parallel and distributed computing on GPU for American basket option pricing
2012
4th IEEE International Conference on Cloud Computing Technology and Science Proceedings
Some optimizations are exposed to get good performance of our parallel algorithm on GPU. In order to benefit from different GPU devices, a dynamic strategy of kernel calibration is proposed. ...
Future work is geared towards the use of distributed computing infrastructures such as Grids and Clouds, equipped with GPUs, in order to benefit for even more parallelism in solving such computing intensive ...
In [15] are introduced software optimizations. One of these targets divergent if-then-else branches in loops: at every iteration it groups same execution paths in a warp, delaying the others. ...
doi:10.1109/cloudcom.2012.6427593
dblp:conf/cloudcom/BenguiguiB12
fatcat:b3z4bskpfndsfnhuskdrecqnya
Fusion of Calling Sites
2015
2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Divergences may impose a heavy burden on the performance of parallel programs. In this paper we propose a compilerlevel optimization to mitigate this performance loss. ...
One of these problems is the reconvergence of divergent threads. A divergence happens at a conditional branch when different threads disagree on the path to follow upon reaching this split point. ...
In SIMT architectures, only paths that are executed by at least one thread are visited. ...
doi:10.1109/sbac-pad.2015.16
dblp:conf/sbac-pad/TeixeiraCP15
fatcat:bdzwo2dthje6xhb6lcbqjgjsga
DARM: Control-Flow Melding for SIMT Thread Divergence Reduction – Extended Version
[article]
2022
arXiv
pre-print
We observe that certain GPGPU kernels with control-flow divergence have similar control-flow structures with similar instructions on both sides of a branch. ...
The control-flow divergence causes performance degradation because both paths of the branch must be executed one after the other. ...
We would like to thank Tim Rogers for his feedback during discussions of this work and also providing us AMD GPUs for the experiments. ...
arXiv:2107.05681v3
fatcat:yda2r426rfdqdd2csn7w47usby
Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework
2009
Genetic Programming and Evolvable Machines
GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, same is the case with conventional LS algorithms. ...
We also discuss the effects of different optimization techniques on the overall execution time. ...
Acknowledgements We would like to thank Dalila Boughachi for her help, regarding the use of genetic algorithms to solve MAX-SAT problem. ...
doi:10.1007/s10710-009-9091-4
fatcat:oxxjtqnderdnhamtl4huueioxa
NVIDIA Tesla: A Unified Graphics and Computing Architecture
2008
IEEE Micro
Erik Lindholm John Nickolls Stuart Oberman John Montrym NVIDIA
Acknowledgments We thank the entire NVIDIA GPU development team for their extraordinary effort in bringing Tesla-based GPUs to market. ...
It eliminates dead code, folds instructions together when feasible, and optimizes SIMT branch divergence and convergence points. Instruction set architecture. ...
SIMD vector architectures, on the other hand, require the software to manually coalesce loads into vectors and to manually manage divergence. SIMT warp scheduling. ...
doi:10.1109/mm.2008.31
fatcat:dfatzl4dwzcjvg7e5ozkbygrli
Software-based branch predication for AMD GPUs
2011
SIGARCH Computer Architecture News
of instruction on the GPU with little to no overhead. ...
Due to the SIMD nature and massive multi-threading architecture of the GPU, branching can be costly if more than one path is taken by a set of concurrent threads in a kernel. ...
The code was then optimized for parallelism and for the AMD GPU architecture and StreamSDK. ...
doi:10.1145/1926367.1926379
fatcat:pzzhdhig2zaxtprs7ldaepx7qi
Warp-aware trace scheduling for GPUs
2014
Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
Here, we propose "Warp-Aware Trace Scheduling" for GPUs. ...
GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). ...
While this subject is well-understood for CPU architectures, it has received little attention for GPUs. Modern GPU architectures [30] have neglected ILP for two major reasons. ...
doi:10.1145/2628071.2628101
dblp:conf/IEEEpact/JablinJMH14
fatcat:crjqndrorjhiddj5c2hmvmlbom
GPU ray tracing
2013
Communications of the ACM
The NVIDIA ® OptiX ™ ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures. ...
For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mechanism similar to virtual function calls. ...
At first blush, this is a challenge for GPUs that rely on SIMT execution for efficiency. ...
doi:10.1145/2447976.2447997
fatcat:fljznpvxmfbsfbqly2slgpcxdm
« Previous
Showing results 1 — 15 out of 429 results