Compressed text indexing with wildcards

Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, Jeffrey Scott Vitter
2013 Journal of Discrete Algorithms  
Let T = T1φ k 1 T2φ k 2 · · · φ k d T d+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P , we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et
more » ... (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nH h + o(n log σ) + O(d log n) bits space, where H h is the hth-order empirical entropy (h = o(log σ n)) of T .
doi:10.1016/j.jda.2012.12.003 fatcat:snw2ro2ofzf2tdhgovwfmdac6y