Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Solon P. Pissis, Ahmad Retha, Marc Herbstritt
2018 Symposium on Experimental and Efficient Algorithms  
An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm 2 + N ) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N • m w ) with pre-processing time and space O(m• m w ), where w
more » ... is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N • M w ) with pre-processing time and space O(M • M w ), which is prohibitive in practice. We present a new on-line O(N • M w )-time algorithm with pre-processing time and space O(M ). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered.
doi:10.4230/lipics.sea.2018.16 dblp:conf/wea/PissisR18 fatcat:won7okgllredpgvtgmjj6b4t4e