Embedding the Ulam metric into l 1

Moses Charikar, Robert Krauthgamer
2006 Theory of Computing  
Edit distance is a fundamental measure of distance between strings, the extensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e. g., 1 ) with small distortion, and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into 1 has received a
more » ... lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured submetrics of the edit distance metric space. Our main technical result is that the Ulam metric, namely, the edit distance on permutations of length at most n, embeds into 1 with distortion O(log n). This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor O(log n). The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i. e., strings where (close by) substrings are distinct. 1 Consider e. g. the permutations P for which {P(2i − 1), P(2i)} = {2i − 1, 2i} for all i = 1, . . . , n/2 . 2 This metric space contains more than 2 n points, and thus our distortion bound beats by far the one that follows from Bourgain's embedding theorem [4] for general finite metrics. Note that in the nearest neighbor search setting, one needs to embed not only S into 1 , but also the (yet unknown) query point. 3 Namely, every permutation is mapped to a subset of some fixed ground set U, such that the edit distance between two permutations is approximately the size of the intersection between the two respective subsets.
doi:10.4086/toc.2006.v002a011 dblp:journals/toc/CharikarK06 fatcat:lvsk4ba6mrcr7airm5lcrigham