Similarity evaluation on tree-structured data

Rui Yang, Panos Kalnis, Anthony K. H. Tung
2005 Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05  
Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform tree-structured data into an approximate numerical multidimensional vector which
more » ... encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filter-and-refine framework to process similarity search on tree-structured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets.
doi:10.1145/1066157.1066243 dblp:conf/sigmod/YangKT05 fatcat:oy3575qbtrb5hpyq6m3ekk2uga