Filters








1,503 Hits in 3.0 sec

Parallel Prefix Sum with SIMD

Wangda Zhang, Yanbin Wang, Kenneth A. Ross
2020 International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures  
In this paper, we study different methods of computing prefix sums with SIMD instructions and multiple threads.  ...  With multithreading, the memory bandwidth can become the bottleneck of prefix sum computations.  ...  sums } Listing 1: In-register prefix sums with Horizontal SIMD synchronization points are needed for this multi-iteration process.  ... 
dblp:conf/adms/ZhangWR20 fatcat:sdhg7j2a5zcrvnwipezmtzagiu

Stream compaction for deferred shading

Jared Hoberock, Victor Lu, Yuntao Jia, John C. Hart
2009 Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09  
In all but simply shaded scenes, we show the expense of sorting shaders pays off with better overall shading performance.  ...  Figure 1 : Efficient execution of multiple shaders poses a challenge for data parallel ray tracing and other deferred shading algorithms.  ...  Acknowledgments This work was funded by the Universal Parallel Computing Research Center at the University of Illinois at Urbana-Champaign.  ... 
doi:10.1145/1572769.1572797 fatcat:qexkoons5jgfxllftxpy7s3xyq

Rank/Select Queries over Mutable Bitmaps [article]

Giulio Ermanno Pibiri, Shunsuke Kanda
2021 arXiv   pre-print
By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets.  ...  Compared to the state-of-the-art, our solution improves runtime with no space degradation.  ...  the prefix-sums in parallel and finds the target 𝑗-th word using SIMD AVX-512 instructions.  ... 
arXiv:2009.12809v2 fatcat:uoyva5jqwbhsppobn3mmrgrfsi

Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions [article]

Philippos Papaphilippou, Paul H. J. Kelly, Wayne Luk
2021 arXiv   pre-print
In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache.  ...  Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions  ...  A similar discussion can be made for prefix sum [48] , though parallel prefix sum uses more comparisons than the serial case, hence the less notable speedups.  ... 
arXiv:2106.07456v1 fatcat:kovjluczwnavpnrnmohlwsftlq

Fast integer compression using SIMD instructions

Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner
2010 Proceedings of the Sixth International Workshop on Data Management on New Hardware - DaMoN '10  
More specifically, we provide SIMD versions of both null suppression and Elias gamma encoding.  ...  In contrast to traditional integer compression, our algorithms make use of the SIMD capabilities of modern processors by encoding multiple integer values at once.  ...  Today, our parallel versions are on par with uncompressed processing in many cases.  ... 
doi:10.1145/1869389.1869394 dblp:conf/damon/SchlegelGL10 fatcat:yiqxth2o5jg5zadcqckzeuvlcm

One Dimensional SIMD Array Processor with Segmentable Bus

Fa-cun Zhang, Wei Liu, Qian-kun Wang
2011 Procedia Engineering  
Additionally, segmentable bus provides high flexibility for different demands so that PEs can cooperate with each other more efficiently.  ...  By the analysis of the application requirement and the architectures of parallel computer, an embedded data parallel computer architecture model is proposed for multimedia processing applications.  ...  To save space, this paper only gives one example of a SIMD computing for the proposed model: a common algorithm that is the prefix sum for calculation of the histogram which is a typical point operation  ... 
doi:10.1016/j.proeng.2011.08.694 fatcat:afkex63hy5hsvmo3aoxgwfdk3u

Bitpacking techniques for indexing genomes: I. Hash tables

Thomas D. Wu
2016 Algorithms for Molecular Biology  
It also has potential applications to other domains requiring differential coding with random access.  ...  Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval.  ...  prefix sum for the next block.  ... 
doi:10.1186/s13015-016-0069-5 pmid:27095998 pmcid:PMC4835851 fatcat:yhuslfqwn5d63ksjoefqg7todq

SIMD compression and the intersection of sorted integers

Daniel Lemire, Leonid Boytsov, Nathan Kurz
2015 Software, Practice & Experience  
We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track.  ...  We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once.  ...  Thankfully we can accelerate the computation of the prefix sum using SIMD instructions. Our first contribution is to revisit the computation of the prefix sum.  ... 
doi:10.1002/spe.2326 fatcat:2wgknj4ysjeermg3pzitmrmtfi

Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks [article]

Daniel Liu, Martin Steinegger
2021 bioRxiv   pre-print
Since differences between cells are small, this allows for maximum parallelism with SIMD vectors.  ...  In this paper, we assume that the CPU supports SIMD vectors with L 16-bit lanes that can be operated on with some basic operation in parallel.  ...  Abbreviations AVX: Advanced Vector Extensions DP: Dynamic Programming SIMD: Single Instruction Multiple Data Competing interests The authors declare that they have no competing interests.  ... 
doi:10.1101/2021.11.08.467651 fatcat:tslwfps625ggjiy6uchy3l5qi4

Policy-based tuning for performance portability and library co-optimization

Duane Merrill, Michael Garland, Andrew Grimshaw
2012 2012 Innovative Parallel Computing (InPar)  
In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor  ...  From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting  ...  We also apply the same thread-serialization techniques for constructing local implementations of parallel prefix sum.  ... 
doi:10.1109/inpar.2012.6339597 fatcat:pvmge5vmbfaghcmnuc6bgesrzq

Designing Efficient Parallel Prefix Sum Algorithms for GPUs

Gabriele Capannini
2011 2011 IEEE 11th International Conference on Computer and Information Technology  
This paper presents a novel and efficient method to compute one of the simplest and most useful building block for parallel algorithms: the parallel prefix sum operation.  ...  Besides its practical relevance, the problem achieves further interest in parallel-computation theory. We firstly describe step-by-step how parallel prefix sum is performed in parallel on GPUs.  ...  In [1] , Blelloch defines prefix sum as follows: Definition 1: Let ⊕ be a binary associative operator with identity I.  ... 
doi:10.1109/cit.2011.11 dblp:conf/IEEEcit/Capannini11 fatcat:qljo54ry4fchzhaolqtu35orce

Bandwidth Efficient Summed Area Table Generation for CUDA
CUDA를 이용한 효율적인 합산 영역 테이블의 생성 방법

Sang-Won Ha, Moon-Hee Choi, Tae-Joon Jun, Jin-Woo Kim, Hye-Ran Byun, Tack-Don Han
2012 Journal of Korea Game Society  
global memory in order to exploit data parallelism.  ...  In this paper, we propose an efficient algorithm for generating the summed area table in the GPGPU environment where the input is decomposed into square sub-images with intermediate data that are propagated  ...  Harris et. al.[7] described summed area table as an example of their work-efficient parallel prefix scan.  ... 
doi:10.7583/jkgs.2012.12.5.67 fatcat:63c4x4wgtvfzvobmumq2d6sfla

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication [article]

Weifeng Liu, Brian Vinter
2015 arXiv   pre-print
For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion.  ...  We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite.  ...  Greathouse (AMD) and Mayank Daga (AMD) for sharing source code, libraries or implementation details of their SpMV algorithms with us.  ... 
arXiv:1503.05032v2 fatcat:qox6vqug3bddnihxpdt65x7uwa

Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

Daisuke Nishiura, Mikito Furuichi, Hide Sakaguchi
2015 Computer Physics Communications  
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors  ...  We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor  ...  Research of Evolutional Science and Technology (CREST) project ''ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with  ... 
doi:10.1016/j.cpc.2015.04.006 fatcat:wcf2ksy7ond4jnthaqkgh2ty2u

Scalable GPU graph traversal

Duane Merrill, Michael Garland, Andrew Grimshaw
2012 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12  
We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity.  ...  Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.  ...  Prefix sum connotes a prefix scan with the addition operator.  ... 
doi:10.1145/2145816.2145832 dblp:conf/ppopp/MerrillGG12 fatcat:dn7judc27nawnpf7iwjpmh3vqa
« Previous Showing results 1 — 15 out of 1,503 results