Compiling regular patterns to sequential machines

Burak Emir
2005 Proceedings of the 2005 ACM symposium on Applied computing - SAC '05  
Pattern matching combined with regular expressions has many applications including semistructured data matching and lexical analysis in compilers. Variables in patterns allow one to refer to parts of the matching input. But some regular patterns suffer from inherent ambiguity, yielding more than one valid result. A match policy like shortest or longest match can disambiguate such patterns. In this paper, we show that regular pattern matching corresponds to sequential transduction. We derive
more » ... ightforward ways to optimally compile regular patterns to sequential machines and to decide when regular patterns are unambiguous. Unambiguous patterns can be matched in a single traversal of the input. Ambiguities in patterns correspond to nondeterminism in sequential machines. Applying the match policy optimally yields two deterministic sequential machines, which produce the shortest match in two consecutive runs.
doi:10.1145/1066677.1066992 dblp:conf/sac/Emir05 fatcat:btj52ok3bnef7ch6btaywqizau