Approximate substring selectivity estimation

Hongrae Lee, Raymond T. Ng, Kyuseok Shim
2009 Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09  
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with
more » ... eas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity.
doi:10.1145/1516360.1516455 dblp:conf/edbt/LeeNS09 fatcat:dkleqy5zejhdxfnqfwfl6cevfm