A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Processing of Multidimensional Range Query Using SIMD Instructions
[chapter]
2011
Communications in Computer and Information Science
The data for one SIMD load instruction must be located in consecutive memory locations and cannot be scattered over the entire memory. ...
Furthermore, our adapted prefix B-Tree enables a high search performance even for larger data types. ...
This paradigm works iteratively over a sorted list of keys A by dividing the search space equally in each iteration. Thus, the algorithm first identifies the median key of a sorted list of keys. ...
doi:10.1007/978-3-642-25483-3_18
fatcat:a3wh37cbp5ckfooj5pbg3lomwm
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
2007
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07
The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11] . ...
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery -vector registers and instructions to manipulate data stored in such registers. ...
s approach of using an optimization algorithm to improve the data permutations is more general than our specific iterative-deepening search [15] . ...
doi:10.1145/1248377.1248436
dblp:conf/spaa/FurtakAN07
fatcat:5rcjmu67xnfkhfxi3fy7ab24cm
Debunking the 100X GPU vs. CPU myth
2010
SIGARCH Computer Architecture News
Recent advances in computing have led to an explosion in the amount of data being generated. ...
In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. ...
We observe that cache blocking improves the performance of Sort and Search by 3-5X. Third, we found that reordering data to prevent irregular memory accesses is critical for SIMD utilization on CPUs. ...
doi:10.1145/1816038.1816021
fatcat:pxizpaiizrdq7gmsfs45obdqwy
Debunking the 100X GPU vs. CPU myth
2010
Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10
Recent advances in computing have led to an explosion in the amount of data being generated. ...
In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. ...
We observe that cache blocking improves the performance of Sort and Search by 3-5X. Third, we found that reordering data to prevent irregular memory accesses is critical for SIMD utilization on CPUs. ...
doi:10.1145/1815961.1816021
dblp:conf/isca/LeeKCDKNSSCHSD10
fatcat:7dgqdsykarcwhp22t7oxgawwza
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops
2011
IEEE Software
We then use the compiler to exploit SIMD parallelism within each subproblem. Finally, we use autotuning to pick the best parameter values throughout the optimization process. ...
In the era of multicores, many applications that tend to require substantial compute power and data crunching (aka Throughput Computing Applications) can now be run on desktop PCs. ...
It results in 2.6x speedup over the serial case. Finally, our approach (the fourth bar) achieves 3.8x speedup over serial without changing the data layout at all. ...
doi:10.1109/ms.2011.2
fatcat:3ysms4aeebarpfhdgbzprloyxi
However, the continued growth in the width of registers and the evolving library of intrinsics make such manual optimizations tedious and error-prone. ...
That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vectorized code for that sorting network. ...
Execution on a VPU follows a "single instruction, multiple data" (SIMD) paradigm by carrying out the "lock-step" operations over packed data. ...
doi:10.1145/2751205.2751247
dblp:conf/ics/HouWF15
fatcat:gcf7vbt64jg6pauvltovsnnoiu
Index Search Algorithms for Databases and Modern CPUs
[article]
2017
arXiv
pre-print
Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+ trees, k-ary binary search, and fast architecture sensitive tree search. ...
While the layout of index structures has been heavily optimized for the data cache of modern CPUs, the instruction cache has been neglected so far. ...
In addition to optimizing data layout for SIMD register size and cache line size, FAST also optimizes data layout for page size. ...
arXiv:1706.06697v1
fatcat:mlxplwmmpbgk3dsohcrhhpc2ki
FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware. ...
However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. ...
Thus optimized CPU search is much better than optimized GPU search in terms of architecture efficiency. ...
doi:10.1145/1807167.1807206
dblp:conf/sigmod/KimCSSNKLBD10
fatcat:cpc26e36xnft3owjv7npmn3z2e
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
2011
ACM Transactions on Database Systems
FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and Single Instruction Multiple Data (SIMD) width of the underlying hardware. ...
However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal. ...
Using the GPU's memory gather capability, P-ary accelerates search on sorted lists. ...
doi:10.1145/2043652.2043655
fatcat:aznq3gvf45g75goaxjno2bnj5u
Understanding and analysis of B+ trees on NVM towards consistency and efficiency
2020
CCF Transactions on High Performance Computing
For example, we analyze the software layer optimizations and hardware layer optimizations separately and find that software layer optimizations do not always improve performance. ...
We discover that the performance of B+ trees is greatly affected by data formats. ...
We have to think over other search optimization schemes which can accelerate the search but bring no extra NVM writes. wB+ tree (Chen and Jin 2015) proposes additional slot arrays to achieve binary search ...
doi:10.1007/s42514-020-00022-z
fatcat:65yz26lihrdfnixtxar4urqrfy
Fast Multi-Column Sorting in Main-Memory Column-Stores
2016
Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16
/or by improving the degree of SIMD data level parallelism. ...
In this paper, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and ...
SIMD data parallelism. ...
doi:10.1145/2882903.2915205
dblp:conf/sigmod/XuFL16
fatcat:h2ritugj2fhl3gkr4w7j4wee7y
A flexible algorithm for calculating pair interactions on SIMD architectures
2013
Computer Physics Communications
In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. ...
Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. ...
Acknowledgments The authors thank Erik Lindahl for providing the analytical approximation of the Ewald correction force and for his advice on x86 SIMD optimization, NVIDIA for advice on CUDA optimization ...
doi:10.1016/j.cpc.2013.06.003
fatcat:epenk5vapramtk5zj3h4pbwqsi
PS-cache: an energy-efficient cache design for chip multiprocessors
2013
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques
Repeated tree traversals are ubiquitous in many domains such as scientific simulation, data mining and graphics. ...
For five irregular tree traversal algorithms, our techniques are able to deliver speedups of 2.78 on average over baseline implementations. ...
A larger block size results in better SIMD utilization as there are more points to search through to create full SIMD packets. ...
doi:10.1109/pact.2013.6618832
dblp:conf/IEEEpact/JoGK13
fatcat:7ek5mqgqk5dixlt4wuczu36ocq
Sort vs. Hash revisited
2009
Proceedings of the VLDB Endowment
Moreover, the performance of our hash join implementation is consistent over a wide range of input data sizes from 64K to 128M tuples and is not affected by data skew. ...
We compare this implementation to our highly optimized sort-based implementation that achieves 47M to 80M tuples per second. ...
Optimizing for DLP Single-Instruction-Multiple-Data (SIMD) execution is an effective way to increase compute density by performing the same operation on multiple data simultaneously. ...
doi:10.14778/1687553.1687564
fatcat:qijrtwz2pfbj3hswnfmo3f2sri
A vectorization approach for multifaceted solids in VecGeom
2019
EPJ Web of Conferences
In single particle mode, VecGeom can still issue SIMD instructions by vectorizing the geometry algorithms featuring loops over internal data structures. ...
The implementations of these algorithms are templated on the input data type and are vectorised based on the VecCore [2] abstraction library in case of multiple inputs in a SIMD vector. ...
The throughput can be increased by operating on multiple input data with SIMD operations (left) or by executing faster single queries vectorizing on internal loops over faces (right). ...
doi:10.1051/epjconf/201921402025
fatcat:33hh2fdbhncftmzhllimbhdj7q
« Previous
Showing results 1 — 15 out of 3,195 results