Filters








13,793 Hits in 4.6 sec

Top-k string similarity search with edit-distance constraints

Dong Deng, Guoliang Li, Jianhua Feng, Wen-Syan Li
2013 2013 IEEE 29th International Conference on Data Engineering (ICDE)  
In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest  ...  edit distances to the query string.  ...  Next we formulate the problem of top-k string similarity search with edit-distance constraints.  ... 
doi:10.1109/icde.2013.6544886 dblp:conf/icde/DengLFL13 fatcat:m7o42h57qvgnffpcdljy5vh5j4

Bed-tree

Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava
2010 Proceedings of the 2010 international conference on Management of data - SIGMOD '10  
Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures.  ...  In this paper we propose the B ed -tree, a B + -tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance.  ...  A top-k selection query q ="Michael Stone" with k = 2 returns strings s3 and s4, which are more similar to q than any other string in D.  ... 
doi:10.1145/1807167.1807266 dblp:conf/sigmod/ZhangHOS10 fatcat:qahi6wxc55dl7jcd5b6bhrmucu

String similarity search and join: a survey

Minghe Yu, Guoliang Li, Dong Deng, Jianhua Feng
2015 Frontiers of Computer Science  
In this paper, we present a comprehensive survey on string similarity search and join.  ...  We then present an extensive set of algorithms for string similarity search and join.  ...  Another study on top-k similarity search with edit-distance constraints is proposed by Deng et al. [48] .  ... 
doi:10.1007/s11704-015-5900-5 fatcat:n6j4xqojhjgulmzblhbygjuqs4

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, Jianhua Feng
2013 Proceedings of the Joint EDBT/ICDT 2013 Workshops on - EDBT '13  
To this end, in this paper we propose parallel algorithms to support efficient similarity search and join with edit-distance constraints.  ...  Although many similarity search and join algorithms have been proposed, they did not utilize the abilities of modern hardware with multi-core processors.  ...  [7] proposed progressive algorithms to find top-k similar strings. CONCLUSION In this paper we study the problem of similarity search and joins with edit distance constraints.  ... 
doi:10.1145/2457317.2457382 dblp:conf/edbt/JiangDWLF13 fatcat:wy7unjrsdnbbhftqrqvkmptq2m

A New Two-Stage Search Procedure for Misuse Detection

Slobodan Petrovic, Katrin Franke
2007 Future Generation Communication and Networking (FGCN 2007)  
the unconstrained edit distance.  ...  A new two-stage indexless search procedure is presented that makes use of the constrained edit distance in IDS misuse detection attack database search.  ...  and the search string. begin comment D consists of k records of length N comment Main loop; D i is the i-th record of D, i =1, 2, . . . , k for i ← 1 until k do begin comment Compute the constrained edit  ... 
doi:10.1109/fgcn.2007.25 dblp:conf/fgcn/PetrovicF07 fatcat:qd5ujcccqben5fxblfrutzttja

Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng
2015 2015 IEEE 31st International Conference on Data Engineering  
String similarity search is a fundamental operation in data cleaning and integration. It has two variants, thresholdbased string similarity search and top-k string similarity search.  ...  For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose  ...  The second is top-k string similarity search, which finds k strings from the string set that have the smallest edit distances to the query.  ... 
doi:10.1109/icde.2015.7113311 dblp:conf/icde/WangLDZF15 fatcat:7vjtbvgn65a3vdxclyir5m64uq

Efficient and effective KNN sequence search with approximate n-grams

Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang
2013 Proceedings of the VLDB Endowment  
In this paper, we address the problem of finding k-nearest neighbors (KNN) in sequence databases using the edit distance.  ...  Unlike most existing works using short and exact ngram matchings together with a filter-and-refine framework for KNN sequence search, our new approach allows us to use longer but approximate n-gram matchings  ...  . • TopkSearch [5] is the most recent method that is proposed to support top-k sequences similarity search with edit-distance constraints. We obtain the executable binary file from the authors.  ... 
doi:10.14778/2732219.2732220 fatcat:33wyo74avrd5bfxjg32lcgkp6i

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data [chapter]

Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser
2010 Lecture Notes in Computer Science  
Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database.  ...  Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences.  ...  Searching with Hamming distance constraints has always better response times than edit distance, in the range of 5% (k = 0) to 65 % with growing k.  ... 
doi:10.1007/978-3-642-13818-8_36 fatcat:2txsh72iujd3pe4dkia7nkwnay

A Tree-based Indexing Approach for Diverse Textual Similarity Search

Minghe Yu, Chengliang Chai, Ge Yu
2020 IEEE Access  
Based on the index tree, we present a top-k search algorithm with efficient pruning techniques.  ...  INDEX TERMS Tree-based indexing, top-k similarity search, textual similarity. 8866 This work is licensed under a Creative Commons Attribution 4.0 License.  ...  In [3] , [13] - [19] , the researchers studied string similarity search based on edit distance.  ... 
doi:10.1109/access.2020.3022057 fatcat:digpvkzugngb3j3xune2bzj75u

State-of-the-art in string similarity search and join

Sebastian Wandelt, Jiaying Wang, Ulf Leser, Dong Deng, Stefan Gerdjikov, Shashwat Mishra, Petar Mitankin, Manish Patil, Enrico Siragusa, Alexander Tiskin, Wei Wang
2014 SIGMOD record  
String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics.  ...  Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two  ...  In Figure 1 , we show existing evaluation results for the most relevant work on string similarity search/join with edit distance constraints.  ... 
doi:10.1145/2627692.2627706 fatcat:m3cwddf22za6pcnolc6cbc34ui

Mining related queries from search engine query logs

Xiaodong Shi, Christopher C. Yang
2006 Proceedings of the 15th international conference on World Wide Web - WWW '06  
Users can use the suggested related queries to tune or redirect the search process.  ...  is a special type of edit distance, to measure the degree of matching between query strings.  ...  The distance between two query strings then is defined to be the sum of the costs in the cheapest chain of edit operations transforming one query string into the other.  ... 
doi:10.1145/1135777.1135956 dblp:conf/www/ShiY06 fatcat:vmqg6xsklvbotjx5iwz6rtrbdy

A comparison of melodic database retrieval techniques using sung queries

Ning Hu, Roger B. Dannenberg
2002 Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries - JCDL '02  
Thus, algorithms for measuring melodic similarity in query-byhumming systems should be robust. We compare several variations of search algorithms in an effort to improve search precision.  ...  Query-by-humming systems search a database of music for good matches to a sung, hummed, or whistled melody.  ...  Edit Distance When melodies are viewed as strings, one measure of similarity is the number or cost of editing operations that must be performed to make the strings identical.  ... 
doi:10.1145/544220.544292 dblp:conf/jcdl/HuD02 fatcat:htyeewlcirgy5duipljv2rmp4m

A comparison of melodic database retrieval techniques using sung queries

Ning Hu, Roger B. Dannenberg
2002 Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries - JCDL '02  
Thus, algorithms for measuring melodic similarity in query-byhumming systems should be robust. We compare several variations of search algorithms in an effort to improve search precision.  ...  Query-by-humming systems search a database of music for good matches to a sung, hummed, or whistled melody.  ...  Edit Distance When melodies are viewed as strings, one measure of similarity is the number or cost of editing operations that must be performed to make the strings identical.  ... 
doi:10.1145/544290.544292 fatcat:l2nqkxqutvdlpglapvcwnrkn5a

Extending dictionary-based entity extraction to tolerate errors

Guoliang Li, Dong Deng, Jianhua Feng
2010 Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10  
In this paper, we study the problem of approximate entity extraction with edit-distance constraints.  ...  A straightforward method first extracts all substrings from the text and then for each substring identifies its similar entities from the dictionary using existing methods for approximate string search  ...  In this paper two strings are similar if their edit distance is no larger than a given edit-distance threshold τ .  ... 
doi:10.1145/1871437.1871616 dblp:conf/cikm/LiDF10 fatcat:sz332c5qwrf6pajn775en5czzu

PepTiger: Search Engine for Error-Tolerant Protein Identification from de Novo Sequences

Irina Fedulova, Zheng Ouyang, Charles Buck, Xiang Zhang
2007 The Open Spectroscopy Journal  
The algorithm is based on approximate string matching followed by a novel scoring procedure which takes into account mass differences and the string distance between de novo sequence and matched peptides  ...  We present a search engine named PepTiger which is capable of correctly matching de novo sequence tags with errors to protein sequences in a protein database.  ...  A 1 = 'abc' 'ad -' A 2 = 'abc' 'a -d' (1) PepTiger Distance Due to the possibility of de novo errors a common edit distance does not fit well for protein similarity search.  ... 
doi:10.2174/187438380701011083 fatcat:uvc7nxgimrf2jb46mdlfk3ix6e
« Previous Showing results 1 — 15 out of 13,793 results