Filters








693 Hits in 6.5 sec

Neither more nor less: optimizing thread-level parallelism for GPGPUs

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis
2013 Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques  
We thus propose a technique which we term Parallel Frame Rendering (PFR). Under PFR, we split the GPU into two clusters where two consecutive frames are rendered in parallel.  ...  Traditional designs improve the degree of multi-threading and the memory bandwidth, as a means of improving performance.  ...  Fragment Processors have private first level caches, but the second level cache is shared among all processors of both clusters, so we expect that one cluster prefetches data for the other cluster as discussed  ... 
doi:10.1109/pact.2013.6618806 dblp:conf/IEEEpact/ArnauPX13 fatcat:c6tyy5vi7bdi3eottnmtk5uily

Parallel quadtree coding of large-scale raster geospatial data on GPGPUs

Jianting Zhang, Simin You, Le Gruenwald
2011 Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS '11  
such data parallelism on massively parallel General Purpose Graphics Processing Units (GPGPUs) that require fine-grained parallelization.  ...  While the inherent data parallelism of large-scale raster geospatial data allows straightforward coarse-grained parallelization at the chunk level on CPUs, it is largely unclear how to effectively exploit  ...  CUDA has two levels of parallelism: block level and thread level [8] .  ... 
doi:10.1145/2093973.2094047 dblp:conf/gis/ZhangYG11 fatcat:zwr7obnjozcwvdmsbquydic5fm

GPGPUs: How to combine high computational power with high reliability

L. Bautista Gomez, F. Cappello, L. Carro, N. DeBardeleben, B. Fang, S. Gurumurthi, K. Pattabiraman, P. Rech, M. Sonza Reorda
2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014  
GPGPUs are increasingly used in several domains, from gaming to different kinds of computationally intensive applications.  ...  Secondly, it provides recent results about the reliability of some GPGPUs, derived from radiation experiments.  ...  [39] propose three approaches for GPGPU reliability that leverage both instruction-level parallelism and thread-level parallelism to replicate the application code.  ... 
doi:10.7873/date.2014.354 dblp:conf/date/Bautista-GomezCCDFGPRR14 fatcat:y476sawz3jgy5g4hox4iarldsy

Exploring memory consistency for massively-threaded throughput-oriented processors

Blake A. Hechtman, Daniel J. Sorin
2013 SIGARCH Computer Architecture News  
We re-visit the issue of hardware consistency models in the new context of massively-threaded throughput-oriented processors (MTTOPs).  ...  A prominent example of an MTTOP is a GPGPU, but other examples include Intel's MIC architecture and some recent academic designs.  ...  Such optimizations for per-thread MLP are likely to be less beneficial for MTTOPs. Threads per Core  Latency Tolerance An MTTOP core supports dozens of threads.  ... 
doi:10.1145/2508148.2485940 fatcat:jrbhg7prcrgx5kajihgfd7snxi

Exploring memory consistency for massively-threaded throughput-oriented processors

Blake A. Hechtman, Daniel J. Sorin
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
We re-visit the issue of hardware consistency models in the new context of massively-threaded throughput-oriented processors (MTTOPs).  ...  A prominent example of an MTTOP is a GPGPU, but other examples include Intel's MIC architecture and some recent academic designs.  ...  Such optimizations for per-thread MLP are likely to be less beneficial for MTTOPs. Threads per Core  Latency Tolerance An MTTOP core supports dozens of threads.  ... 
doi:10.1145/2485922.2485940 dblp:conf/isca/HechtmanS13 fatcat:tuakbaweffg4hmekzomyqi2imm

Design and implementation of a parallel priority queue on many-core architectures

Xi He, Dinesh Agarwal, Sushil K. Prasad
2012 2012 19th International Conference on High Performance Computing  
Compared to this, our optimized multicore parallelization of parallel heap yields only 2-3 fold speedup for such fine-grained loads.  ...  This parallelization of a tree-based data structure on GPGPUs provides a roadmap for future parallelizations of other such data structures.  ...  As we can see, many of the priority queue application are computation-intensive, but the parallelism is regular neither in space nor on time.  ... 
doi:10.1109/hipc.2012.6507490 dblp:conf/hipc/HeAP12 fatcat:l3yvxramknek7hzqgy6fhu2jba

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Yi Yang, Chao Li, Huiyang Zhou
2015 Journal of Computer Science and Technology  
Parallel programs consist of series of code sections with different thread-level parallelism (TLP).  ...  Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1007/s11390-015-1500-y fatcat:42baxdr3hbf6tnxgo25ycfp2c4

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

Peng Di, Qing Wan, Xuemeng Zhang, Hui Wu, Jingling Xue
2010 2010 39th International Conference on Parallel Processing  
We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs.  ...  To exploit the full potential of GPGPUs for generalpurpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed.  ...  further on GPGPUs if they are algorithmically restructured to be more amendable to GPGPU parallelization, judiciously optimized, and carefully tuned by a performance-tuning tool.  ... 
doi:10.1109/icpp.2010.13 dblp:conf/icpp/DiWZWX10 fatcat:udvg2zldevb5lm6r3yafb2km5q

CUDA-NP

Yi Yang, Huiyang Zhou
2014 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14  
Parallel programs consist of series of code sections with different thread-level parallelism (TLP).  ...  Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1145/2555243.2555254 dblp:conf/ppopp/YangZ14 fatcat:ts2yttcyzndrliod625rkzjyty

CUDA-NP

Yi Yang, Huiyang Zhou
2014 SIGPLAN notices  
Parallel programs consist of series of code sections with different thread-level parallelism (TLP).  ...  Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69  ...  We also want to thank the ARC Cluster [16] for providing Nvidia K20c GPUs.  ... 
doi:10.1145/2692916.2555254 fatcat:kxedfqo55fgrdoxghfjjbi27tu

Shadowfax

Alexander M. Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan
2011 Proceedings of the 5th international workshop on Virtualization technologies in distributed computing - VTDC '11  
CPUs and CUDA-supported GPGPUs to form a 'virtual execution platform' for an application.  ...  To address this problem and to support increased flexibility in usage models for CUDA-based GPGPU applications, our research proposes GPGPU assemblies, where each assembly combines a desired number of  ...  Reattaching to a remote GPGPU shows almost no gains as the CPU is still required for moving work to its destination. Our data shows that neither local nor remote vGPUs can remove this impediment.  ... 
doi:10.1145/1996121.1996124 dblp:conf/hpdc/MerrittGVGS11 fatcat:qjnqtxg5wjd6lnwicoplvg4e4y

The Case For Heterogeneous HTAP

Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, Anastasia Ailamaki
2017 Conference on Innovative Data Systems Research  
It is thus necessary to revisit database engine design because current engines can neither deal with the lack of cache coherence nor exploit heterogeneous parallelism.  ...  Second, as GPGPUs overcome programmability, performance, and interfacing limitations, they are being increasingly adopted by emerging servers to expose heterogeneous parallelism.  ...  We would like to thank the anonymous reviewers and the DIAS laboratory members for their constructive feedback.  ... 
dblp:conf/cidr/AppuswamyKPA17 fatcat:4dzw4rtkurazpgidsle3756mzy

Efficient Probabilistic Latent Semantic Indexing using Graphics Processing Unit

Eli Koffi Kouassi, Toshiyuki Amagasa, Hiroyuki Kitagawa
2011 Procedia Computer Science  
We compare the results to the most recent parallel execution of PLSI which combines a method of parallelization by OpenMP with the Message Passing Interface (MPI) for distributed memory parallelization  ...  The first method is to accelerate the Expectation-Maximization (EM) computation by applying GPGPU matrix-vector multiplication.  ...  Acknowledgement The authors gratefully acknowledge the funding support of Grant-in-Aid for Scientific Research on Priority Areas by MEXT (#21013004).  ... 
doi:10.1016/j.procs.2011.04.040 fatcat:jxavxqszonawtpqdsdp6xebfje

Scan Primitives for GPU Computing [article]

Shubhabrata Sengupta, Mark Harris, Yao Zhang, John D. Owens
2007 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS '04  
The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications.  ...  matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for  ...  Acknowledgements Many thanks to Jim Ahrens, Guy Blelloch, Jeff Inman, and Pat McCormick for thoughtful discussions about our scan implementation and its applications.  ... 
doi:10.2312/eggh/eggh07/097-106 fatcat:zbhoiatqsfazzdizmjs7yrpuku

Optimising memory management for Belief Propagation in Junction Trees using GPGPUs

Filippo Bistaffa, Alessandro Farinelli, Nicola Bombieri
2014 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)  
reduced data transfers between the host and the GPGPU, and scalability.  ...  Such approach has significant computational requirements that can be addressed by using highly parallel architectures (i.e., General Purpose Graphic Processing Units) to parallelise the message update  ...  On the other hand, (2e) contains neither d i nor d j , since it refers to all the variables after X i , thus (2e) is not affected either.  ... 
doi:10.1109/padsw.2014.7097850 dblp:conf/icpads/BistaffaFB14 fatcat:vxwwf3gcy5h4pfx2cwzfeb2rua
« Previous Showing results 1 — 15 out of 693 results