A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X ...
In this paper, we present FAST, an extremely fast architecture sensitive layout of the index tree. ...
In this paper, we present FAST (Fast Architecture Sensitive Tree) search algorithm that exploits high compute in modern processors for index tree traversal. ...
doi:10.1145/1807167.1807206
dblp:conf/sigmod/KimCSSNKLBD10
fatcat:cpc26e36xnft3owjv7npmn3z2e
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
2011
ACM Transactions on Database Systems
Designing fast architecture-sensitive tree search on modern multicore/many-core processors. ...
FAST eliminates the impact of memory latency, and exploits thread-level and data-level parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second for large trees ...
Architecture-Sensitive Tree Search on Multicore/Many-Core Processors
22:7 CPUs and shared buffers on GPUs). ...
doi:10.1145/2043652.2043655
fatcat:aznq3gvf45g75goaxjno2bnj5u
A fast GPU algorithm for graph connectivity
2010
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
We also draw interesting observations on why PRAM algorithms, such as the Shiloach-Vishkin algorithm may not be a good fit for the GPU and how they should be modified. ...
For instance, our implementation finds connected components of a graph of 10 million nodes and 60 million edges in about 500 milliseconds on a GPU, given a random edge list. ...
The Shiloach-Vishkin algorithm as proposed [24] may not be quite suitable on modern architectures such as the GPU. ...
doi:10.1109/ipdpsw.2010.5470817
dblp:conf/ipps/SomanKN10
fatcat:njb65qatt5hd3dmdn6v3smwupm
Scalable fast multipole methods on distributed heterogeneous architectures
2011
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well ...
We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. ...
On modern multicore and GPU architectures, this requires parallelization of the algorithm. The FMM has been sought to be parallelized almost since its invention -see e.g., [5, 6, 7, 8] . ...
doi:10.1145/2063384.2063432
dblp:conf/sc/HuGD11
fatcat:dxmswho42vhp7pm5klcymnebx4
Jet: Fast quantum circuit simulations with parallel task-based tensor-network contraction
[article]
2022
arXiv
pre-print
We demonstrate the advantages of our method by benchmarking our code on several Sycamore-53 and Gaussian boson sampling (GBS) supremacy circuits against other simulators. ...
iii) the concurrent contraction of tensor networks on all available hardware. ...
The authors thank SOSCIP for their computational resources and financial support. We acknowledge the computational resources and support from SciNet. ...
arXiv:2107.09793v3
fatcat:4hoswy5yrnagdo5zs5pzystj2i
QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors
2014
PLoS ONE
We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. ...
Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs ...
Author Contributions Conceived and designed the experiments: AG SD. Performed the experiments: AG. Analyzed the data: AG SD. Contributed reagents/ materials/analysis tools: AG. ...
doi:10.1371/journal.pone.0088901
pmid:24586435
pmcid:PMC3934876
fatcat:wfhb4nqdsre43aegdai2lka5t4
Fast Automatic Heuristic Construction Using Active Learning
[chapter]
2015
Lecture Notes in Computer Science
We demonstrate this technique by automatically constructing a model to determine on which device to execute four parallel programs at differing problem dimensions for a representative Cpu-Gpu based heterogeneous ...
Our approach, on the other hand, uses active learning to select and only focus on the most useful training examples. ...
Our method then searches for an input for which the intermediate models or heuristics most disagree on whether it should be run on the Cpu or the Gpu. ...
doi:10.1007/978-3-319-17473-0_10
fatcat:gwcklt44kvcgfhgpirsh76lvne
Fast parallel GPU-sorting using a hybrid algorithm
2008
Journal of Parallel and Distributed Computing
It is 6 times faster than single CPU quicksort, and 10% faster than the recent GPU-based radix sort. ...
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. ...
Meanwhile, splitting the list into too many parts would lead to longer binary searches for each bucketsort-thread and more traffic between the CPU and GPU. ...
doi:10.1016/j.jpdc.2008.05.012
fatcat:n5i3ukuba5ahda2tqwkt7z7l7a
Fast k-NNG Construction with GPU-Based Quick Multi-Select
2014
PLoS ONE
Benchmarks show significant improvement over state-of-the-art implementations of the k-NN search on GPUs. ...
Our optimization makes clever use of warp voting functions available on the latest GPUs along with use-controlled cache. ...
For low dimensional data-sets, there are a variety of indexing data structures such as kdtrees [9] , BBD-trees [10] , random-projection trees (rp-trees) [11] , and hashing based on locally sensitive ...
doi:10.1371/journal.pone.0092409
pmid:24809341
pmcid:PMC4014471
fatcat:mcwn2t4adjhz7bqdob2uhtl6a4
Index Search Algorithms for Databases and Modern CPUs
[article]
2017
arXiv
pre-print
Over the years, many different indexing techniques and search algorithms have been proposed, including CSS-trees, CSB+ trees, k-ary binary search, and fast architecture sensitive tree search. ...
We show how to combine index compilation with previous approaches, such as binary tree search, cache-sensitive tree search, and the architecture-sensitive tree search presented by Kim et al. ...
Fast Architecture Sensitive Tree Search Fast architecture sensitive tree search (FAST, [KCS + 10]) unifies the optimality properties of CSS-tree search and k-ary search. ...
arXiv:1706.06697v1
fatcat:mlxplwmmpbgk3dsohcrhhpc2ki
Applications and Techniques for Fast Machine Learning in Science
[article]
2021
arXiv
pre-print
The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for ...
training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. ...
hardware, e.g., CPU, GPU, ASIC, and FPGA. ...
arXiv:2110.13041v1
fatcat:cvbo2hmfgfcuxi7abezypw2qrm
Applying Deep Learning to Fast Radio Burst Classification
2018
Astronomical Journal
Upcoming Fast Radio Burst (FRB) surveys will search ∼10 ^3 beams on sky with very high duty cycle, generating large numbers of single-pulse candidates. ...
We construct a tree-like deep neural network (DNN) that takes multiple or individual data products as input (e.g. dynamic spectra and multi-beam detection information) and trains on them simultaneously ...
We thank Emily Petroff for helpful comments on the manuscript, as well as our anonymous referee for valuable feedback. ...
doi:10.3847/1538-3881/aae649
fatcat:pcq3h6e6jvg25gd7bjur3sqmkm
Jet: Fast quantum circuit simulations with parallel task-based tensor-network contraction
2022
Quantum
We demonstrate the advantages of our method by benchmarking our code on several Sycamore-53 and Gaussian boson sampling (GBS) supremacy circuits against other simulators. ...
iii) the concurrent contraction of tensor networks on all available hardware. ...
The authors thank SOSCIP for their computational resources and financial support. We acknowledge the computational resources and support from SciNet. ...
doi:10.22331/q-2022-05-09-709
fatcat:4upmghf7wrg4vajckhk3kajjym
Fast Local Tone Mapping, Summed-Area Tables and Mesopic Vision Simulation
[chapter]
2012
Computer Graphics
to photographs and films. ...
Display devices, on the other hand, are much more restrictive, since there is no way to dynamically improve or alter their inherently fixed dynamic range capabilities. ...
As the expected clash between GPU and multi-core CPU architectures comes to a close, such memory access constraints tend to disappear. ...
doi:10.5772/37288
fatcat:eb73wm3khnarbihmm5jjdt5c3m
Corrfunc — A Suite of Blazing Fast Correlation Functions on the CPU
[article]
2019
arXiv
pre-print
The improved performance of Corrfunc arises from both efficient algorithms as well as software design that suits the underlying hardware of modern CPUs. ...
Corrfunc is designed to be both user-friendly and fast and is publicly available at https://github.com/manodeep/Corrfunc. ...
Mao and A. Hearin for constructive discussion about Corrfunc over the years. MS would particularly like to thank J. ...
arXiv:1911.03545v1
fatcat:a3ydapxxc5culkhpe4ndwedihe
« Previous
Showing results 1 — 15 out of 2,553 results