Filters








36 Hits in 5.3 sec

Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Minsoo Rhu, Mattan Erez
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges.  ...  We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment.  ...  ACKNOWLEDGEMENTS We thank the developers of GPGPU-Sim and GPUOcelot. We also thank the anonymous reviewers, who provided excellent feedback for preparing the final version of this paper. REFERENCES  ... 
doi:10.1145/2485922.2485953 dblp:conf/isca/RhuE13 fatcat:qtfc5bgxlzggjeymbqp5xao3f4

Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Minsoo Rhu, Mattan Erez
2013 SIGARCH Computer Architecture News  
This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges.  ...  We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment.  ...  ACKNOWLEDGEMENTS We thank the developers of GPGPU-Sim and GPUOcelot. We also thank the anonymous reviewers, who provided excellent feedback for preparing the final version of this paper. REFERENCES  ... 
doi:10.1145/2508148.2485953 fatcat:kcqz3xs5frg4bpndomig27gsfq

A Multi-instruction Streams Extension Mechanism for SIMD Processor

Yuanxi Peng, Feng Zhou, Yue Hai, Yaohua Wang
2017 Chinese journal of electronics  
SIMD resource utilizationin GPGPUs with SIMD lane permutation”, Interna-  ...  utilization with thread-lane shuffled compaction in GPGPU”, Chinese Journal of Electronics, Vol.24, No.2, pp.684–688, 2015. [4] W.W.L. Fung, I. Sham and G.  ... 
doi:10.1049/cje.2017.09.013 fatcat:ctvv3otp2ndstaqamer65pvydu

Low-Power, Real-Time Object-Recognition Processors for Mobile Vision Systems

Jinwook Oh, Gyeonghoon Kim, Injoon Hong, Junyoung Park, Seungjin Lee, Joo-Young Kim, Jeong-Ho Woo, Hoi-Jun Yoo
2012 IEEE Micro  
To support two ROI tiles per core and eight ROI tiles in four SFECs maximally, the SMT is integrated for the SIMD core with increased system utilization.  ...  In BONE-V5, the IPC and utilization of the 16-lane SIMD processing element are increased by utilizing not only the fine-grained pipeline but also the SMT operation based on the pipelined SIMD data path  ... 
doi:10.1109/mm.2012.90 fatcat:e6st56neobcy3fho2fipizqzkq

Horton Tables: Fast Hash Tables for In-Memory Data-Intensive Computing

Alex D. Breslow, Dong Ping Zhang, Joseph L. Greathouse, Nuwan Jayasena, Dean M. Tullsen
2016 USENIX Annual Technical Conference  
With these advancements, Horton tables outperform BCHTs by 17% to 89%.  ...  Positive lookups (key is in the table) and negative lookups (where it is not) on average access 1.5 and 2.0 buckets, respectively, which results in 50 to 100% more table-containing cache lines to be accessed  ...  For a primer on SIMD and GPGPU architectures, we recommend these excellent references: H&P (Ch. 4) [30] and Keckler et al. [39] .  ... 
dblp:conf/usenix/BreslowZGJT16 fatcat:b4cq5mcyjrhgpgzrivhy7acbo4

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Byunghyun Jang, Dana Schaa, Perhaad Mistry, David Kaeli
2011 IEEE Transactions on Parallel and Distributed Systems  
In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies  ...  Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy.  ...  in each GPGPU programming model.  ... 
doi:10.1109/tpds.2010.107 fatcat:lync4a5tlvf37g3w5kuomqzxje

Adapting Particle Filter Algorithms to Many-Core Architectures

Mehdi Chitchian, Alexander S. van Amesfoort, Andrea Simonetto, Tamas Keviczky, Henk J. Sips
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
It is ideal for non-linear, non-Gaussian dynamical systems with applications in many areas, such as computer vision, robotics, and econometrics.  ...  In this study, we investigate how to design a particle filter framework for complex estimation problems using many-core architectures.  ...  For our robotic arm application with nine state variables utilizing over one million particles, we pushed our GPGPUs to attain estimation rates of 100-200 Hz.  ... 
doi:10.1109/ipdps.2013.88 dblp:conf/ipps/ChitchianASKS13 fatcat:lvjezokuxbdq5ndtqlokhoarze

Data-Parallel Hashing Techniques for GPU Architectures [article]

Brenton Lessley
2018 arXiv   pre-print
Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning.  ...  In SIMT, scalar instructions control individual threads, whereas in SIMD, vector instructions control the entire set of data lanes.  ...  SIMT execution is similar to SIMD, but differs in that SIMT applies one instruction to multiple independent warp threads in parallel, instead of to multiple data lanes.  ... 
arXiv:1807.04345v1 fatcat:hjqikv3wjfgahavy3lbweolhh4

Monte Carlo methods for massively parallel computers [article]

Martin Weigel
2017 arXiv   pre-print
Here we outline the opportunities and challenges of massively parallel computing for Monte Carlo simulations in statistical physics, with a focus on the simulation of systems exhibiting phase transitions  ...  Applications that require substantial computational resources today cannot avoid the use of heavily parallel machines.  ...  It corresponds to a tiled SIMD architecture, where each multiprocessor has SIMD semantics, but the vector lanes are promoted to fibers with the possibility of divergent control flow through the masking  ... 
arXiv:1709.04394v1 fatcat:l25zmj2rcfej3ew4xh6mv5ul3u

RT-CUDA: A Software Tool for CUDA Code Restructuring

Ayaz H. Khan, Mayez Al-Mouhamed, Muhammed Al-Mulhem, Adel F. Ahmed
2016 International journal of parallel programming  
For example, to exchange data between 4 groups of 8 lanes in a SIMD manner.  ...  So, in order to efficiently utilize the GPU resources, implementations showed be done with detailed understanding of the underlying architecture and CUDA kernel optimizations that is very tedious even  ... 
doi:10.1007/s10766-016-0433-6 fatcat:xxikpkyrkvdizgijskmk2qxkay

Polychroniou_columbia_0054D_14504.pdf [article]

2018
In the era of hardware becoming increasingly parallel and datasets consistently growing in size, this thesis can serve as a compass for developing hardware-conscious databases with truly high-performance  ...  We evaluate our algorithms and techniques on both mainstream hardware and on many-integrated-core platforms, and combine our techniques in a new query engine design that can better utilize the features  ...  In Section 5.9, we discuss how SIMD vectorization relates to SIMT in GPGPUs and we conclude in Section 5.10.  ... 
doi:10.7916/d8k94qmp fatcat:glh57zlatnfo3e6lbarghk2d2i

GPU power modeling and architectural enhancements for GPU energy efficiency [article]

Jan Lucas, Technische Universität Berlin, Technische Universität Berlin, Ben Juurlink
2019
This reduces the required control logic but also results in lower performance in applications with irregular control flow.  ...  We continue with enhancements to improve the energy efficiency of the GPU cores.  ...  If all threads in a warp follow the same control flow, the full throughput of the SIMD execution units can be utilized.  ... 
doi:10.14279/depositonce-7874 fatcat:wbmij23r2ngtfaskosnrsxt5gu

A hybrid architecture for bioinformatics

Bertil Schmidt, Heiko Schröder, Manfred Schimmler
2002 Future generations computer systems  
sockets contain devices with maximum lane count.  ...  One of both calls will return with a "Device or resource busy" error code, as defined in the Linux standard libraries as EBUSY.  ... 
doi:10.1016/s0167-739x(02)00058-4 fatcat:ktrtocmzh5hkzgdprusjxddmbi

High-performance computing systems: Status and outlook

J. J. Dongarra, A. J. van der Steen
2012 Acta Numerica  
We review the different ways devised to speed them up, both with regard to components and their architecture.  ...  In addition, we discuss the requirements for software that can take advantage of existing and future architectures.  ...  Scheduling based on a directed acyclic graph also requires new approaches to optimizing for resource utilization without compromising spatial locality.  ... 
doi:10.1017/s0962492912000050 fatcat:n6yodkox5zb6xmlep6gvayud2m

Irregularity mitigation and portability abstractions for accelerated sparse matrix factorization

Daniel Thürck
2021
In this thesis, we investigate new ways to mitigate the inherent irregularity in sparse matrix factorizations and decompose the resulting computation into simple kernels which are portable across a diverse  ...  At the same time, we are witnessing a shift in terms of hardware in the high-performance computing field: as hardware designers try to avoid the quadratically increasing energy consumption for higher clock  ...  While CPUs with SIMD registers are designed to minimize latency of sequential computations, GPUs originate in 3D rendering and thus are constructed with throughput maximization in mind.  ... 
doi:10.26083/tuprints-00017951 fatcat:jhttfyetqnbmvigpltkkt624km
« Previous Showing results 1 — 15 out of 36 results