EDIT-DISTANCE OF WEIGHTED AUTOMATA: GENERAL DEFINITIONS AND ALGORITHMS

MEHRYAR MOHRI
2003 International Journal of Foundations of Computer Science  
The problem of computing the similarity between two sequences arises in many areas such as computational biology and natural language processing. A common measure of the similarity of two strings is their edit-distance, that is the minimal cost of a series of symbol insertions, deletions, or substitutions transforming one string into the other. In several applications such as speech recognition or computational biology, the objects to compare are distributions over strings, i.e., sets of
more » ... representing a range of alternative hypotheses with their associated weights or probabilities. We define the editdistance of two distributions over strings and present algorithms for computing it when these distributions are given by automata. In the particular case where two sets of strings are given by unweighted automata, their edit-distance can be computed using the general algorithm of composition of weighted transducers combined with a singlesource shortest-paths algorithm. In the general case, we show that general weighted automata algorithms over the appropriate semirings can be used to compute the editdistance of two weighted automata exactly. These include classical algorithms such as the composition and ǫ-removal of weighted transducers and a new and simple synchronization algorithm for weighted transducers which, combined with ǫ-removal, can be used to normalize weighted transducers with bounded delays. Our algorithm for computing the edit-distance of weighted automata can be used to improve the word accuracy of automatic speech recognition systems. It can also be extended to provide an edit-distance automaton useful for re-scoring and other post-processing purposes in the context of large-vocabulary speech recognition.
doi:10.1142/s0129054103002114 fatcat:io4mojcj3rgynpsko3yslsl3ry