Compact Recognizers of Episode Sequences

Alberto Apostolico, Mikhail J. Atallah
2002 Information and Computation  
Given two strings T = a 1 : : : a n and P = b 1 : : : b m over an alphabet , the problem of testing whether P occurs as a subsequence of T is trivially solved in linear time. It is also known that a simple O(n log j j) time preprocessing of T makes it easy to decide subsequently for any P and in at most jP j logj j character comparisons, whether P is a subsequence of T . These problems become more complicated if one asks instead whether P occurs as a subsequence of some substring Y of T of
more » ... ed length. This paper presents an automaton built on the textstring T and capable of identifying all distinct minimal substrings Y of X having P as a subsequence. By a substring Y being minimal with respect to P , it is meant t h a t P is not a subsequence of any proper substring of Y . F or every minimal substring Y , the automaton recognizes the occurrence of P having lexicographically smallest sequence of symbol positions in Y . It is not di cult to realize such an automaton in time and space O(n 2 ) for a text of n characters. One result of this paper consists of bringing those bounds down to linear or O(n logn), respectively, depending on whether the alphabet is bounded or of arbitrary size, thereby m a t c hing the respective complexities of o -line exact string searching. Having built the automaton, the search for all lexicographically earliest occurrences of P in X is carried out in time O(n + P m i=1 rocc i i log n log j j), where rocc i is the number of distinct minimal substrings of T having b 1 : : : b i as a subsequence. All log factors appearing in the above bounds can be further reduced to log log by resort to known integer-handling data structures. Index Terms | Algorithms, pattern matching, subsequence and episode searching, DAWG, su x automaton, compact subsequence automaton, skip-edge DAWG, forward failure function, skip-link.
doi:10.1006/inco.2002.3143 fatcat:kido2ngxsjae3fkk46ghed4scq