Boosting the Quality of Approximate String Matching by Synonyms

Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao
2015 ACM Transactions on Database Systems  
A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same
more » ... orld object. For example, "Bill" is a short form of "William," and "Database Management Systems" can be abbreviated as "DBMS." Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches. . 2015. Boosting the quality of approximate string matching by synonyms.
doi:10.1145/2818177 fatcat:mpbtvnuifvaktg6n3tpuxmdaqe