A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
2013
Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13
This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. ...
We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. ...
ACKNOWLEDGEMENTS We thank the developers of GPGPU-Sim and GPUOcelot. We also thank the anonymous reviewers, who provided excellent feedback for preparing the final version of this paper.
REFERENCES ...
doi:10.1145/2485922.2485953
dblp:conf/isca/RhuE13
fatcat:qtfc5bgxlzggjeymbqp5xao3f4
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
2013
SIGARCH Computer Architecture News
This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. ...
We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. ...
ACKNOWLEDGEMENTS We thank the developers of GPGPU-Sim and GPUOcelot. We also thank the anonymous reviewers, who provided excellent feedback for preparing the final version of this paper.
REFERENCES ...
doi:10.1145/2508148.2485953
fatcat:kcqz3xs5frg4bpndomig27gsfq
A Multi-instruction Streams Extension Mechanism for SIMD Processor
2017
Chinese journal of electronics
SIMD resource
utilizationin GPGPUs with SIMD lane permutation”, Interna- ...
utilization with thread-lane shuffled compaction in GPGPU”,
Chinese Journal of Electronics, Vol.24, No.2, pp.684–688, 2015.
[4] W.W.L. Fung, I. Sham and G. ...
doi:10.1049/cje.2017.09.013
fatcat:ctvv3otp2ndstaqamer65pvydu
Low-Power, Real-Time Object-Recognition Processors for Mobile Vision Systems
2012
IEEE Micro
To support two ROI tiles per core and eight ROI tiles in four SFECs maximally, the SMT is integrated for the SIMD core with increased system utilization. ...
In BONE-V5, the IPC and utilization of the 16-lane SIMD processing element are increased by utilizing not only the fine-grained pipeline but also the SMT operation based on the pipelined SIMD data path ...
doi:10.1109/mm.2012.90
fatcat:e6st56neobcy3fho2fipizqzkq
Horton Tables: Fast Hash Tables for In-Memory Data-Intensive Computing
2016
USENIX Annual Technical Conference
With these advancements, Horton tables outperform BCHTs by 17% to 89%. ...
Positive lookups (key is in the table) and negative lookups (where it is not) on average access 1.5 and 2.0 buckets, respectively, which results in 50 to 100% more table-containing cache lines to be accessed ...
For a primer on SIMD and GPGPU architectures, we recommend these excellent references: H&P (Ch. 4) [30] and Keckler et al. [39] . ...
dblp:conf/usenix/BreslowZGJT16
fatcat:b4cq5mcyjrhgpgzrivhy7acbo4
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures
2011
IEEE Transactions on Parallel and Distributed Systems
In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies ...
Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. ...
in each GPGPU programming model. ...
doi:10.1109/tpds.2010.107
fatcat:lync4a5tlvf37g3w5kuomqzxje
Adapting Particle Filter Algorithms to Many-Core Architectures
2013
2013 IEEE 27th International Symposium on Parallel and Distributed Processing
It is ideal for non-linear, non-Gaussian dynamical systems with applications in many areas, such as computer vision, robotics, and econometrics. ...
In this study, we investigate how to design a particle filter framework for complex estimation problems using many-core architectures. ...
For our robotic arm application with nine state variables utilizing over one million particles, we pushed our GPGPUs to attain estimation rates of 100-200 Hz. ...
doi:10.1109/ipdps.2013.88
dblp:conf/ipps/ChitchianASKS13
fatcat:lvjezokuxbdq5ndtqlokhoarze
Data-Parallel Hashing Techniques for GPU Architectures
[article]
2018
arXiv
pre-print
Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. ...
In SIMT, scalar instructions control individual threads, whereas in SIMD, vector instructions control the entire set of data lanes. ...
SIMT execution is similar to SIMD, but differs in that SIMT applies one instruction to multiple independent warp threads in parallel, instead of to multiple data lanes. ...
arXiv:1807.04345v1
fatcat:hjqikv3wjfgahavy3lbweolhh4
Monte Carlo methods for massively parallel computers
[article]
2017
arXiv
pre-print
Here we outline the opportunities and challenges of massively parallel computing for Monte Carlo simulations in statistical physics, with a focus on the simulation of systems exhibiting phase transitions ...
Applications that require substantial computational resources today cannot avoid the use of heavily parallel machines. ...
It corresponds to a tiled SIMD architecture, where each multiprocessor has SIMD semantics, but the vector lanes are promoted to fibers with the possibility of divergent control flow through the masking ...
arXiv:1709.04394v1
fatcat:l25zmj2rcfej3ew4xh6mv5ul3u
RT-CUDA: A Software Tool for CUDA Code Restructuring
2016
International journal of parallel programming
For example, to exchange data between 4 groups of 8 lanes in a SIMD manner. ...
So, in order to efficiently utilize the GPU resources, implementations showed be done with detailed understanding of the underlying architecture and CUDA kernel optimizations that is very tedious even ...
doi:10.1007/s10766-016-0433-6
fatcat:xxikpkyrkvdizgijskmk2qxkay
Polychroniou_columbia_0054D_14504.pdf
[article]
2018
In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance ...
We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features ...
In Section 5.9, we discuss how SIMD vectorization relates to SIMT in GPGPUs and we conclude in Section 5.10. ...
doi:10.7916/d8k94qmp
fatcat:glh57zlatnfo3e6lbarghk2d2i
GPU power modeling and architectural enhancements for GPU energy efficiency
[article]
2019
This reduces the required control logic but also results in lower performance in applications with irregular control flow. ...
We continue with enhancements to improve the energy efficiency of the GPU cores. ...
If all threads in a warp follow the same control flow, the full throughput of the SIMD execution units can be utilized. ...
doi:10.14279/depositonce-7874
fatcat:wbmij23r2ngtfaskosnrsxt5gu
A hybrid architecture for bioinformatics
2002
Future generations computer systems
sockets contain devices with maximum lane count. ...
One of both calls will return with a "Device or resource busy" error code, as defined in the Linux standard libraries as EBUSY. ...
doi:10.1016/s0167-739x(02)00058-4
fatcat:ktrtocmzh5hkzgdprusjxddmbi
High-performance computing systems: Status and outlook
2012
Acta Numerica
We review the different ways devised to speed them up, both with regard to components and their architecture. ...
In addition, we discuss the requirements for software that can take advantage of existing and future architectures. ...
Scheduling based on a directed acyclic graph also requires new approaches to optimizing for resource utilization without compromising spatial locality. ...
doi:10.1017/s0962492912000050
fatcat:n6yodkox5zb6xmlep6gvayud2m
Irregularity mitigation and portability abstractions for accelerated sparse matrix factorization
2021
In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse ...
At the same time, we are witnessing a shift in terms of hardware in the high-performance computing field: as hardware designers try to avoid the quadratically increasing energy consumption for higher clock ...
While CPUs with SIMD registers are designed to minimize latency of sequential computations, GPUs originate in 3D rendering and thus are constructed with throughput maximization in mind. ...
doi:10.26083/tuprints-00017951
fatcat:jhttfyetqnbmvigpltkkt624km
« Previous
Showing results 1 — 15 out of 36 results