Indexing a Dictionary for Subset Matching Queries [chapter]

Gad M. Landau, Dekel Tsur, Oren Weimann
String Processing and Information Retrieval  
We consider a subset matching variant of the Dictionary Query paradigm. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p appears in D. p is said to appear in D if there exists some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. Furthermore, for every pattern p that appears
more » ... D we would like to know the number of times p appears in D. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm + |Σ| k n lg(min{n, m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ| k n + |Σ| k/2 n lg(min{n, m})) preprocessing time and O(|p| lg lg |Σ| + min{|p|, lg(|Σ| k n)} lg lg(|Σ| k n)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [14, 17] . There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the "pure parsimony criteria" [13, 16] , greedy heuristics such as "Clarks rule" [6, 18] , EM based algorithms [1, 11, 12, 20, 26, 30] , and algorithms for inferring haplotypes from a set of Trios [4, 27] .
doi:10.1007/978-3-540-75530-2_18 dblp:conf/spire/LandauTW07 fatcat:hhx3bbdypfajdoefhbsvw4x63i