Filters








309 Hits in 5.7 sec

Coconut

Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas
2018 Proceedings of the VLDB Endowment  
As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os.  ...  We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order.  ...  The algorithm di↵ers from the original SIMS algorithm in that it searches over the sorted invSAX representations for the initial pruning, and it then uses the Coconut-Tree index to get the raw data-series  ... 
doi:10.14778/3199517.3199519 fatcat:eliltucsbbge5fpe7iwv7koziy

Coconut: a scalable bottom-up approach for building data series indexes [article]

Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas
2020 arXiv   pre-print
As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os.  ...  We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order.  ...  The algorithm differs from the original SIMS algorithm in that it searches over the sorted invSAX representations for the initial pruning, and it then uses the Coconut-Tree index to get the raw data-series  ... 
arXiv:2006.13713v1 fatcat:stt5dnstqberjcb26idtpxb76m

Coconut: sortable summarizations for scalable indexes over static and streaming data series [article]

Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas
2021 arXiv   pre-print
As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os.  ...  We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order.  ...  The partitioning phase involves scanning the raw file in chunks that fit in main memory, sorting each chunk in main memory, and flushing it to secondary storage as a sorted partition.  ... 
arXiv:2006.11474v2 fatcat:552r3xccczgdtffkegwreesjbm

Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores [chapter]

Astrid Rheinländer, Ulf Leser
2012 Lecture Notes in Computer Science  
Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions.  ...  We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory.  ...  Similarity search starts with a given search string q and traverses each trie partition in a PeARL index starting at root.  ... 
doi:10.1007/978-3-642-29740-3_3 fatcat:kjucuguhezhtflqz6x5woqhbpi

Efficient and effective KNN sequence search with approximate n-grams

Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang
2013 Proceedings of the VLDB Endowment  
as a basis of KN-N candidates pruning.  ...  Based on this new idea, we devise a pipeline framework over a two-level index for searching KNN in the sequence database.  ...  Acknowledgment Wang was supported by the Singapore NRF under its IR-C@SG Funding Initiative and administered by the IDMPO, at the SeSaMe Centre.  ... 
doi:10.14778/2732219.2732220 fatcat:33wyo74avrd5bfxjg32lcgkp6i

Practical Batch-Updatable External Hashing with Sorting [chapter]

Hyeontaek Lim, David G. Andersen, Michael Kaminsky
2013 2013 Proceedings of the Fifteenth Workshop on Algorithm Engineering and Experiments (ALENEX)  
Our scheme combines three key techniques: (1) a new index data structure (Entropy-Coded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction  ...  This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast  ...  We would like to thank Guy Blelloch, Danny Sleator, Michelle Mazurek, and anonymous reviewers of ALENEX13 for their valuable feedback and Fabiano C. Botelho for providing the source code of EPH.  ... 
doi:10.1137/1.9781611972931.15 dblp:conf/alenex/LimAK13 fatcat:iahlzxoxnzdxhe2xw7kccniowm

SILT

Hyeontaek Lim, Bin Fan, David G. Andersen, Michael Kaminsky
2011 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP '11  
SILT (Small Index Large Table) is a memory-efficient, highperformance key-value store system based on flash storage that scales to serve billions of key-value items on a single node.  ...  It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each.  ...  Gibbons, Vijay Vasudevan, and Amar Phanishayee for their feedback, Guy Blelloch and Rasmus Pagh for pointing out several algorithmic possibilities, and Robert Morris for shepherding this paper.  ... 
doi:10.1145/2043556.2043558 dblp:conf/sosp/LimFAK11 fatcat:ehrgp54s2nb5bf6yvefnlth4fu

Identification of Similar Strings in a Dataset using Scalable Join

Khalid F.Alfatmi, Archana S. Vaidya
2015 International Journal of Computer Applications  
The proposed architecture uses the MapReduce concept and is based on inverted index and multiple prefix filtering methods.  ...  Similarity Join plays an important role in data integration and cleansing, record linkage and data de-duplication. It finds similar sting pairs from collections of strings.  ...  The edit similarity join method based on the Trie-tree was proposed in [7] , in which sub-trie pruning techniques are applied.  ... 
doi:10.5120/21329-4295 fatcat:zqygay4gnnc3xnpmrasprnv3l4

Location-aware instant search

Ruicheng Zhong, Ju Fan, Guoliang Li, Kian-Lee Tan, Lizhu Zhou
2012 Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12  
PR-Tree is a tree-based index structure which seamlessly integrates the textual description and spatial information to index the spatial data.  ...  To address this challenge, in this paper we propose a novel index structure, prefixregion tree (called PR-Tree), to efficiently support locationaware instant search.  ...  The poor performance of MT was due to its trie-based index structure, which failed to utilize effective spatial pruning. Index Sizes We then examined the space complexity of different indices.  ... 
doi:10.1145/2396761.2396812 dblp:conf/cikm/ZhongFLTZ12 fatcat:jab7mil4fzfjtcbgbtjb5le3vq

Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

Daniel Liu
2019 PeerJ  
Conclusions We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.  ...  Results We reformulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used.  ...  Like the adjacency method, it involves iterating through the sorted UMIs from highest to lowest frequency.  ... 
doi:10.7717/peerj.8275 pmid:31871845 pmcid:PMC6921982 fatcat:mra73difbndctmk5roisphi3hm

Algorithms for efficiently collapsing reads with Unique Molecular Identifiers [article]

Daniel Liu
2019 bioRxiv   pre-print
Conclusions: We present a new formulation of the UMI deduplication problem, and show that it can be solved faster, with more sophisticated data structures.  ...  Results: We formulate the problem of deduplicating UMIs in a manner that enables optimizations to be made, and more efficient data structures to be used.  ...  Like the adjacency method, it involves iterating through the sorted UMIs from highest to lowest frequency.  ... 
doi:10.1101/648683 fatcat:clz6nlfjtjbfxmbx5xqq7iraeq

CedrusDB: Persistent Key-Value Store with Memory-Mapped Lazy-Trie [article]

Maofan Yin, Hongbo Zhang, Robbert van Renesse, Emin Gün Sirer
2021 arXiv   pre-print
A "lazy-trie" is a variant of the hash-trie data structure that achieves near-optimal height, has practical storage overhead, and can be maintained on-disk with standard write-ahead logging.  ...  As a result of RAM becoming cheaper, there has been a trend in key-value store design towards maintaining a fast in-memory index (such as a hash table) while logging user operations to disk, allowing high  ...  It uses a B + -tree for indexing and stages insertions to LSM.  ... 
arXiv:2005.13762v3 fatcat:wkbsap2jbjf67eyesbc6zis4du

Efficient and Scalable Processing of String Similarity Join

Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen, Anthony K.H. Tung
2013 IEEE Transactions on Knowledge and Data Engineering  
The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user specified threshold.  ...  These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: (1) generating candidate pairs by traversing the inverted index; and (2) verifying the candidate  ...  The token file and sorting rules are distributed as cache files before executing the task.  ... 
doi:10.1109/tkde.2012.195 fatcat:duymkqx6z5br5alidzx4vfju4i

Fuzzy Keyword Search over Encrypted Data using Symbol-Based Trie-traverse Search Scheme in Cloud Computing [article]

P. Naga Aswani, K. Chandra Shekar
2012 arXiv   pre-print
We further propose a brand new symbol-based trie-traverse searching scheme, where a multi-way tree structure is built up using symbols transformed from the resulted fuzzy keyword sets.  ...  We exploit edit distance to quantify keywords similarity and develop two advanced techniques on constructing fuzzy keyword sets, which achieve optimized storage and representation overheads.  ...  The cloud server is responsible for mapping the searching request to set of data files, where each file is indexed by a file ID and linked to a set of keywords.  ... 
arXiv:1211.3682v1 fatcat:zjmz3pne6zdszfymzfkwq5pntm

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: The P2P-DIET Approach [chapter]

Stratos Idreos, Christos Tryfonopoulos, Manolis Koubarakis, Yannis Drougas
2004 Lecture Notes in Computer Science  
P2P-DIET offers a simple data model for the description of network resources based on attributes with values of type text and a query language based on concepts from Information Retrieval.  ...  The focus of this paper is on the main modelling concepts of P2P-DIET (metadata, advertisements and queries), the routing algorithms (inspired by the publish/subscibe system SIENA) and the scalable indexing  ...  ., files in a file-sharing application) are kept at client nodes, although it is possible in special cases to store resources at super-peer nodes.  ... 
doi:10.1007/978-3-540-30192-9_49 fatcat:3bkcz75e7ralxal64gdozxfstq
« Previous Showing results 1 — 15 out of 309 results