A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Parallel Prefix Sum with SIMD
2020
International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures
In this paper, we study different methods of computing prefix sums with SIMD instructions and multiple threads. ...
With multithreading, the memory bandwidth can become the bottleneck of prefix sum computations. ...
sums } Listing 1: In-register prefix sums with Horizontal SIMD synchronization points are needed for this multi-iteration process. ...
dblp:conf/adms/ZhangWR20
fatcat:sdhg7j2a5zcrvnwipezmtzagiu
Stream compaction for deferred shading
2009
Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09
In all but simply shaded scenes, we show the expense of sorting shaders pays off with better overall shading performance. ...
Figure 1 : Efficient execution of multiple shaders poses a challenge for data parallel ray tracing and other deferred shading algorithms. ...
Acknowledgments This work was funded by the Universal Parallel Computing Research Center at the University of Illinois at Urbana-Champaign. ...
doi:10.1145/1572769.1572797
fatcat:qexkoons5jgfxllftxpy7s3xyq
Rank/Select Queries over Mutable Bitmaps
[article]
2021
arXiv
pre-print
By adapting and properly extending some results concerning prefix-sum data structures, we present a practical solution to the problem, tailored for modern CPU instruction sets. ...
Compared to the state-of-the-art, our solution improves runtime with no space degradation. ...
the prefix-sums in parallel and finds the target 𝑗-th word using SIMD AVX-512 instructions. ...
arXiv:2009.12809v2
fatcat:uoyva5jqwbhsppobn3mmrgrfsi
Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions
[article]
2021
arXiv
pre-print
In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. ...
Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions ...
A similar discussion can be made for prefix sum [48] , though parallel prefix sum uses more comparisons than the serial case, hence the less notable speedups. ...
arXiv:2106.07456v1
fatcat:kovjluczwnavpnrnmohlwsftlq
Fast integer compression using SIMD instructions
2010
Proceedings of the Sixth International Workshop on Data Management on New Hardware - DaMoN '10
More specifically, we provide SIMD versions of both null suppression and Elias gamma encoding. ...
In contrast to traditional integer compression, our algorithms make use of the SIMD capabilities of modern processors by encoding multiple integer values at once. ...
Today, our parallel versions are on par with uncompressed processing in many cases. ...
doi:10.1145/1869389.1869394
dblp:conf/damon/SchlegelGL10
fatcat:yiqxth2o5jg5zadcqckzeuvlcm
One Dimensional SIMD Array Processor with Segmentable Bus
2011
Procedia Engineering
Additionally, segmentable bus provides high flexibility for different demands so that PEs can cooperate with each other more efficiently. ...
By the analysis of the application requirement and the architectures of parallel computer, an embedded data parallel computer architecture model is proposed for multimedia processing applications. ...
To save space, this paper only gives one example of a SIMD computing for the proposed model: a common algorithm that is the prefix sum for calculation of the histogram which is a typical point operation ...
doi:10.1016/j.proeng.2011.08.694
fatcat:afkex63hy5hsvmo3aoxgwfdk3u
Bitpacking techniques for indexing genomes: I. Hash tables
2016
Algorithms for Molecular Biology
It also has potential applications to other domains requiring differential coding with random access. ...
Conclusions: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. ...
prefix sum for the next block. ...
doi:10.1186/s13015-016-0069-5
pmid:27095998
pmcid:PMC4835851
fatcat:yhuslfqwn5d63ksjoefqg7todq
SIMD compression and the intersection of sorted integers
2015
Software, Practice & Experience
We experiment with two TREC text collections, GOV2 and ClueWeb09 (Category B), using logs from the TREC million-query track. ...
We exploit the fact that one SIMD instruction can compare 4 pairs of integers at once. ...
Thankfully we can accelerate the computation of the prefix sum using SIMD instructions. Our first contribution is to revisit the computation of the prefix sum. ...
doi:10.1002/spe.2326
fatcat:2wgknj4ysjeermg3pzitmrmtfi
Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks
[article]
2021
bioRxiv
pre-print
Since differences between cells are small, this allows for maximum parallelism with SIMD vectors. ...
In this paper, we assume that the CPU supports SIMD vectors with L 16-bit lanes that can be operated on with some basic operation in parallel. ...
Abbreviations AVX: Advanced Vector Extensions DP: Dynamic Programming SIMD: Single Instruction Multiple Data
Competing interests The authors declare that they have no competing interests. ...
doi:10.1101/2021.11.08.467651
fatcat:tslwfps625ggjiy6uchy3l5qi4
Policy-based tuning for performance portability and library co-optimization
2012
2012 Innovative Parallel Computing (InPar)
In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor ...
From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting ...
We also apply the same thread-serialization techniques for constructing local implementations of parallel prefix sum. ...
doi:10.1109/inpar.2012.6339597
fatcat:pvmge5vmbfaghcmnuc6bgesrzq
Designing Efficient Parallel Prefix Sum Algorithms for GPUs
2011
2011 IEEE 11th International Conference on Computer and Information Technology
This paper presents a novel and efficient method to compute one of the simplest and most useful building block for parallel algorithms: the parallel prefix sum operation. ...
Besides its practical relevance, the problem achieves further interest in parallel-computation theory. We firstly describe step-by-step how parallel prefix sum is performed in parallel on GPUs. ...
In [1] , Blelloch defines prefix sum as follows: Definition 1: Let ⊕ be a binary associative operator with identity I. ...
doi:10.1109/cit.2011.11
dblp:conf/IEEEcit/Capannini11
fatcat:qljo54ry4fchzhaolqtu35orce
Bandwidth Efficient Summed Area Table Generation for CUDA
CUDA를 이용한 효율적인 합산 영역 테이블의 생성 방법
2012
Journal of Korea Game Society
CUDA를 이용한 효율적인 합산 영역 테이블의 생성 방법
global memory in order to exploit data parallelism. ...
In this paper, we propose an efficient algorithm for generating the summed area table in the GPGPU environment where the input is decomposed into square sub-images with intermediate data that are propagated ...
Harris et. al.[7] described summed area table as an example of their work-efficient parallel prefix scan. ...
doi:10.7583/jkgs.2012.12.5.67
fatcat:63c4x4wgtvfzvobmumq2d6sfla
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
[article]
2015
arXiv
pre-print
For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. ...
We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. ...
Greathouse (AMD) and Mayank Daga (AMD) for sharing source code, libraries or implementation details of their SpMV algorithms with us. ...
arXiv:1503.05032v2
fatcat:qox6vqug3bddnihxpdt65x7uwa
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing
2015
Computer Physics Communications
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors ...
We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor ...
Research of Evolutional Science and Technology (CREST) project ''ppOpen-HPC: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with ...
doi:10.1016/j.cpc.2015.04.006
fatcat:wcf2ksy7ond4jnthaqkgh2ty2u
Scalable GPU graph traversal
2012
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. ...
Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. ...
Prefix sum connotes a prefix scan with the addition operator. ...
doi:10.1145/2145816.2145832
dblp:conf/ppopp/MerrillGG12
fatcat:dn7judc27nawnpf7iwjpmh3vqa
« Previous
Showing results 1 — 15 out of 1,503 results