Approximate String Processing

Marios Hadjieleftheriou
2009 Foundations and Trends in Databases  
One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string v, the goal is to find all strings s ∈ S that have a user specified
more » ... ve a user specified degree of similarity to v. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type. Query Types There are two fundamental query types in string processing: Selections and Joins. There are two fundamental query strategies: All-matches and Top-k matches. Selection Queries All-match selection queries return all data strings whose similarity with the query string is larger than or equal to a user specified threshold. Definition 4.1 (All-Match Selection Query). Given a string similarity function Θ, a set of strings S, a query string v, and a positive threshold θ, identify the answer set Top-k selection queries return, among all strings in the data, the k strings with the largest similarity to the query. Definition 4.2 (Top-k Selection Query). Given a string similarity function Θ, a set of strings S, a query string v, and a positive integer k, 290 4.2 Join Queries 291 identify the answer set A, s.t. |A| = k and ∀s ∈ A, s ∈ S \ A : Θ(v, s) ≥ Θ(v, s ). Top-k queries are very useful in practice since in many applications it is difficult to decide in advance a meaningful threshold θ for running an all-match query. Clearly, all-match queries are easier to evaluate than top-k queries, given that a cutoff similarity threshold for top-k queries cannot be decided in advance, making initial pruning of strings difficult. Nevertheless, once k good answers have been identified (good in a sense that the k-th answer has similarity sufficiently close to the correct k-th answer) top-k queries essentially degenerate to all-match queries. Query answering strategies typically try to identify k good answers as fast as possible and subsequently revert to all-match query strategies. Join Queries Given two sets of strings and a user specified threshold, all-match join queries return all pairs of strings in the cross product of the two sets, with similarity larger than or equal to the threshold. Definition 4.3 (All-Match Join Query). Given a string similarity function Θ, two sets of strings S, R, and a positive threshold θ, identify the answer set A = {(s, r) ∈ S × R : Θ(s, r) ≥ θ}. Top-k join queries return the k pairs with the largest similarity among all pairs in the cross product. Definition 4.4 (Top-k Join Query). Given a string similarity function Θ, two sets of strings S, R, and a positive integer k, identify the answer set A s.t. |A| = k and ∀(s, r) ∈ A and ∀(s , r ) ∈ (S × R)\ A : Θ(s, r) ≥ Θ(s , r ).
doi:10.1561/1900000010 fatcat:uyacijo4dvgzzjwxvt4jgv5w3i