A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Coconut
2018
Proceedings of the VLDB Endowment
As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os. ...
We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. ...
The algorithm di↵ers from the original SIMS algorithm in that it searches over the sorted invSAX representations for the initial pruning, and it then uses the Coconut-Tree index to get the raw data-series ...
doi:10.14778/3199517.3199519
fatcat:eliltucsbbge5fpe7iwv7koziy
Coconut: a scalable bottom-up approach for building data series indexes
[article]
2020
arXiv
pre-print
As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os. ...
We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. ...
The algorithm differs from the original SIMS algorithm in that it searches over the sorted invSAX representations for the initial pruning, and it then uses the Coconut-Tree index to get the raw data-series ...
arXiv:2006.13713v1
fatcat:stt5dnstqberjcb26idtpxb76m
Coconut: sortable summarizations for scalable indexes over static and streaming data series
[article]
2021
arXiv
pre-print
As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. ...
We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. ...
The partitioning phase involves scanning the raw file in chunks that fit in main memory, sorting each chunk in main memory, and flushing it to secondary storage as a sorted partition. ...
arXiv:2006.11474v2
fatcat:552r3xccczgdtffkegwreesjbm
Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores
[chapter]
2012
Lecture Notes in Computer Science
Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. ...
We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. ...
Similarity search starts with a given search string q and traverses each trie partition in a PeARL index starting at root. ...
doi:10.1007/978-3-642-29740-3_3
fatcat:kjucuguhezhtflqz6x5woqhbpi
Efficient and effective KNN sequence search with approximate n-grams
2013
Proceedings of the VLDB Endowment
as a basis of KN-N candidates pruning. ...
Based on this new idea, we devise a pipeline framework over a two-level index for searching KNN in the sequence database. ...
Acknowledgment Wang was supported by the Singapore NRF under its IR-C@SG Funding Initiative and administered by the IDMPO, at the SeSaMe Centre. ...
doi:10.14778/2732219.2732220
fatcat:33wyo74avrd5bfxjg32lcgkp6i
Practical Batch-Updatable External Hashing with Sorting
[chapter]
2013
2013 Proceedings of the Fifteenth Workshop on Algorithm Engineering and Experiments (ALENEX)
Our scheme combines three key techniques: (1) a new index data structure (Entropy-Coded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction ...
This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast ...
We would like to thank Guy Blelloch, Danny Sleator, Michelle Mazurek, and anonymous reviewers of ALENEX13 for their valuable feedback and Fabiano C. Botelho for providing the source code of EPH. ...
doi:10.1137/1.9781611972931.15
dblp:conf/alenex/LimAK13
fatcat:iahlzxoxnzdxhe2xw7kccniowm
SILT (Small Index Large Table) is a memory-efficient, highperformance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. ...
It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. ...
Gibbons, Vijay Vasudevan, and Amar Phanishayee for their feedback, Guy Blelloch and Rasmus Pagh for pointing out several algorithmic possibilities, and Robert Morris for shepherding this paper. ...
doi:10.1145/2043556.2043558
dblp:conf/sosp/LimFAK11
fatcat:ehrgp54s2nb5bf6yvefnlth4fu
Identification of Similar Strings in a Dataset using Scalable Join
2015
International Journal of Computer Applications
The proposed architecture uses the MapReduce concept and is based on inverted index and multiple prefix filtering methods. ...
Similarity Join plays an important role in data integration and cleansing, record linkage and data de-duplication. It finds similar sting pairs from collections of strings. ...
The edit similarity join method based on the Trie-tree was proposed in [7] , in which sub-trie pruning techniques are applied. ...
doi:10.5120/21329-4295
fatcat:zqygay4gnnc3xnpmrasprnv3l4
Location-aware instant search
2012
Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12
PR-Tree is a tree-based index structure which seamlessly integrates the textual description and spatial information to index the spatial data. ...
To address this challenge, in this paper we propose a novel index structure, prefixregion tree (called PR-Tree), to efficiently support locationaware instant search. ...
The poor performance of MT was due to its trie-based index structure, which failed to utilize effective spatial pruning.
Index Sizes We then examined the space complexity of different indices. ...
doi:10.1145/2396761.2396812
dblp:conf/cikm/ZhongFLTZ12
fatcat:jab7mil4fzfjtcbgbtjb5le3vq
Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
2019
PeerJ
Conclusions We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures. ...
Results We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. ...
Like the adjacency method, it involves iterating through the sorted UMIs from highest to lowest frequency. ...
doi:10.7717/peerj.8275
pmid:31871845
pmcid:PMC6921982
fatcat:mra73difbndctmk5roisphi3hm
Algorithms for efficiently collapsing reads with Unique Molecular Identifiers
[article]
2019
bioRxiv
pre-print
Conclusions: We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures. ...
Results: We formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used. ...
Like the adjacency method, it involves iterating through the sorted UMIs from highest to lowest frequency. ...
doi:10.1101/648683
fatcat:clz6nlfjtjbfxmbx5xqq7iraeq
CedrusDB: Persistent Key-Value Store with Memory-Mapped Lazy-Trie
[article]
2021
arXiv
pre-print
A "lazy-trie" is a variant of the hash-trie data structure that achieves near-optimal height, has practical storage overhead, and can be maintained on-disk with standard write-ahead logging. ...
As a result of RAM becoming cheaper, there has been a trend in key-value store design towards maintaining a fast in-memory index (such as a hash table) while logging user operations to disk, allowing high ...
It uses a B + -tree for indexing and stages insertions to LSM. ...
arXiv:2005.13762v3
fatcat:wkbsap2jbjf67eyesbc6zis4du
Efficient and Scalable Processing of String Similarity Join
2013
IEEE Transactions on Knowledge and Data Engineering
The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user specified threshold. ...
These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: (1) generating candidate pairs by traversing the inverted index; and (2) verifying the candidate ...
The token file and sorting rules are distributed as cache files before executing the task. ...
doi:10.1109/tkde.2012.195
fatcat:duymkqx6z5br5alidzx4vfju4i
Fuzzy Keyword Search over Encrypted Data using Symbol-Based Trie-traverse Search Scheme in Cloud Computing
[article]
2012
arXiv
pre-print
We further propose a brand new symbol-based trie-traverse searching scheme, where a multi-way tree structure is built up using symbols transformed from the resulted fuzzy keyword sets. ...
We exploit edit distance to quantify keywords similarity and develop two advanced techniques on constructing fuzzy keyword sets, which achieve optimized storage and representation overheads. ...
The cloud server is responsible for mapping the searching request to set of data files, where each file is indexed by a file ID and linked to a set of keywords. ...
arXiv:1211.3682v1
fatcat:zjmz3pne6zdszfymzfkwq5pntm
Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: The P2P-DIET Approach
[chapter]
2004
Lecture Notes in Computer Science
P2P-DIET offers a simple data model for the description of network resources based on attributes with values of type text and a query language based on concepts from Information Retrieval. ...
The focus of this paper is on the main modelling concepts of P2P-DIET (metadata, advertisements and queries), the routing algorithms (inspired by the publish/subscibe system SIENA) and the scalable indexing ...
., files in a file-sharing application) are kept at client nodes, although it is possible in special cases to store resources at super-peer nodes. ...
doi:10.1007/978-3-540-30192-9_49
fatcat:3bkcz75e7ralxal64gdozxfstq
« Previous
Showing results 1 — 15 out of 309 results