SEPIA: estimating selectivities of approximate string predicates in large Databases

Liang Jin, Chen Li, Rares Vernica
2008 The VLDB journal  
Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964." Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called Sepia, to solve the
more » ... roblem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.
doi:10.1007/s00778-007-0061-2 fatcat:axxg7w7tkjdbhjkus4vzeu5acu