Scaling link-based similarity search

Dániel Fogaras, Balázs Rácz
2005 Proceedings of the 14th international conference on World Wide Web - WWW '05  
To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our
more » ... thods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26] . Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than singlestep neighborhoods.
doi:10.1145/1060745.1060839 dblp:conf/www/FogarasR05 fatcat:4i2242pw2rd53lrsr4q7i3l5wm