Name Phylogeny: A Generative Model of String Variation

Nicholas Andrews, Jason Eisner, Mark Dredze
2012 Conference on Empirical Methods in Natural Language Processing  
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM
more » ... g algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
dblp:conf/emnlp/AndrewsED12 fatcat:rlscp5wpgvb3zjcogooobi4vbu