Filters








3,195 Hits in 4.4 sec

Processing of Multidimensional Range Query Using SIMD Instructions [chapter]

Peter Chovanec, Michal Krátký
2011 Communications in Computer and Information Science  
The data for one SIMD load instruction must be located in consecutive memory locations and cannot be scattered over the entire memory.  ...  Furthermore, our adapted prefix B-Tree enables a high search performance even for larger data types.  ...  This paradigm works iteratively over a sorted list of keys A by dividing the search space equally in each iteration. Thus, the algorithm first identifies the median key of a sorted list of keys.  ... 
doi:10.1007/978-3-642-25483-3_18 fatcat:a3wh37cbp5ckfooj5pbg3lomwm

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Timothy Furtak, José Nelson Amaral, Robert Niewiadomski
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11] .  ...  Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery -vector registers and instructions to manipulate data stored in such registers.  ...  s approach of using an optimization algorithm to improve the data permutations is more general than our specific iterative-deepening search [15] .  ... 
doi:10.1145/1248377.1248436 dblp:conf/spaa/FurtakAN07 fatcat:5rcjmu67xnfkhfxi3fy7ab24cm

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 SIGARCH Computer Architecture News  
Recent advances in computing have led to an explosion in the amount of data being generated.  ...  In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels.  ...  We observe that cache blocking improves the performance of Sort and Search by 3-5X. Third, we found that reordering data to prevent irregular memory accesses is critical for SIMD utilization on CPUs.  ... 
doi:10.1145/1816038.1816021 fatcat:pxizpaiizrdq7gmsfs45obdqwy

Debunking the 100X GPU vs. CPU myth

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty
2010 Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10  
Recent advances in computing have led to an explosion in the amount of data being generated.  ...  In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels.  ...  We observe that cache blocking improves the performance of Sort and Search by 3-5X. Third, we found that reordering data to prevent irregular memory accesses is critical for SIMD utilization on CPUs.  ... 
doi:10.1145/1815961.1816021 dblp:conf/isca/LeeKCDKNSSCHSD10 fatcat:7dgqdsykarcwhp22t7oxgawwza

A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops

Chi-Keung Luk, Ryan Newton, William Hasenplaugh, Mark Hampton, Geoff Lowney
2011 IEEE Software  
We then use the compiler to exploit SIMD parallelism within each subproblem. Finally, we use autotuning to pick the best parameter values throughout the optimization process.  ...  In the era of multicores, many applications that tend to require substantial compute power and data crunching (aka Throughput Computing Applications) can now be run on desktop PCs.  ...  It results in 2.6x speedup over the serial case. Finally, our approach (the fourth bar) achieves 3.8x speedup over serial without changing the data layout at all.  ... 
doi:10.1109/ms.2011.2 fatcat:3ysms4aeebarpfhdgbzprloyxi

ASPaS

Kaixi Hou, Hao Wang, Wu-chun Feng
2015 Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15  
However, the continued growth in the width of registers and the evolving library of intrinsics make such manual optimizations tedious and error-prone.  ...  That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vectorized code for that sorting network.  ...  Execution on a VPU follows a "single instruction, multiple data" (SIMD) paradigm by carrying out the "lock-step" operations over packed data.  ... 
doi:10.1145/2751205.2751247 dblp:conf/ics/HouWF15 fatcat:gcf7vbt64jg6pauvltovsnnoiu

Index Search Algorithms for Databases and Modern CPUs [article]

Florian Gross
2017 arXiv   pre-print
Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+ trees, k-ary binary search, and fast architecture sensitive tree search.  ...  While the layout of index structures has been heavily optimized for the data cache of modern CPUs, the instruction cache has been neglected so far.  ...  In addition to optimizing data layout for SIMD register size and cache line size, FAST also optimizes data layout for page size.  ... 
arXiv:1706.06697v1 fatcat:mlxplwmmpbgk3dsohcrhhpc2ki

FAST

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey
2010 Proceedings of the 2010 international conference on Management of data - SIGMOD '10  
FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware.  ...  However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal.  ...  Thus optimized CPU search is much better than optimized GPU search in terms of architecture efficiency.  ... 
doi:10.1145/1807167.1807206 dblp:conf/sigmod/KimCSSNKLBD10 fatcat:cpc26e36xnft3owjv7npmn3z2e

Designing fast architecture-sensitive tree search on modern multicore/many-core processors

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey
2011 ACM Transactions on Database Systems  
FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and Single Instruction Multiple Data (SIMD) width of the underlying hardware.  ...  However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal.  ...  Using the GPU's memory gather capability, P-ary accelerates search on sorted lists.  ... 
doi:10.1145/2043652.2043655 fatcat:aznq3gvf45g75goaxjno2bnj5u

Understanding and analysis of B+ trees on NVM towards consistency and efficiency

Jiangkun Hu, Youmin Chen, Youyou Lu, Xubin He, Jiwu Shu
2020 CCF Transactions on High Performance Computing  
For example, we analyze the software layer optimizations and hardware layer optimizations separately and find that software layer optimizations do not always improve performance.  ...  We discover that the performance of B+ trees is greatly affected by data formats.  ...  We have to think over other search optimization schemes which can accelerate the search but bring no extra NVM writes. wB+ tree (Chen and Jin 2015) proposes additional slot arrays to achieve binary search  ... 
doi:10.1007/s42514-020-00022-z fatcat:65yz26lihrdfnixtxar4urqrfy

Fast Multi-Column Sorting in Main-Memory Column-Stores

Wenjian Xu, Ziqiang Feng, Eric Lo
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
/or by improving the degree of SIMD data level parallelism.  ...  In this paper, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and  ...  SIMD data parallelism.  ... 
doi:10.1145/2882903.2915205 dblp:conf/sigmod/XuFL16 fatcat:h2ritugj2fhl3gkr4w7j4wee7y

A flexible algorithm for calculating pair interactions on SIMD architectures

Szilárd Páll, Berk Hess
2013 Computer Physics Communications  
In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential.  ...  Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization.  ...  Acknowledgments The authors thank Erik Lindahl for providing the analytical approximation of the Ewald correction force and for his advice on x86 SIMD optimization, NVIDIA for advice on CUDA optimization  ... 
doi:10.1016/j.cpc.2013.06.003 fatcat:epenk5vapramtk5zj3h4pbwqsi

PS-cache: an energy-efficient cache design for chip multiprocessors

Youngjoon Jo, Michael Goldfarb, Milind Kulkarni
2013 Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques  
Repeated tree traversals are ubiquitous in many domains such as scientific simulation, data mining and graphics.  ...  For five irregular tree traversal algorithms, our techniques are able to deliver speedups of 2.78 on average over baseline implementations.  ...  A larger block size results in better SIMD utilization as there are more points to search through to create full SIMD packets.  ... 
doi:10.1109/pact.2013.6618832 dblp:conf/IEEEpact/JoGK13 fatcat:7ek5mqgqk5dixlt4wuczu36ocq

Sort vs. Hash revisited

Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, Pradeep Dubey
2009 Proceedings of the VLDB Endowment  
Moreover, the performance of our hash join implementation is consistent over a wide range of input data sizes from 64K to 128M tuples and is not affected by data skew.  ...  We compare this implementation to our highly optimized sort-based implementation that achieves 47M to 80M tuples per second.  ...  Optimizing for DLP Single-Instruction-Multiple-Data (SIMD) execution is an effective way to increase compute density by performing the same operation on multiple data simultaneously.  ... 
doi:10.14778/1687553.1687564 fatcat:qijrtwz2pfbj3hswnfmo3f2sri

A vectorization approach for multifaceted solids in VecGeom

John Apostolakis, Gabriele Cosmo, Andrei Gheata, Mihaela Gheata, Raman Sehgal, Sandro Wenzel, A. Forti, L. Betev, M. Litmaath, O. Smirnova, P. Hristov
2019 EPJ Web of Conferences  
In single particle mode, VecGeom can still issue SIMD instructions by vectorizing the geometry algorithms featuring loops over internal data structures.  ...  The implementations of these algorithms are templated on the input data type and are vectorised based on the VecCore [2] abstraction library in case of multiple inputs in a SIMD vector.  ...  The throughput can be increased by operating on multiple input data with SIMD operations (left) or by executing faster single queries vectorizing on internal loops over faces (right).  ... 
doi:10.1051/epjconf/201921402025 fatcat:33hh2fdbhncftmzhllimbhdj7q
« Previous Showing results 1 — 15 out of 3,195 results