Scalable similarity search for SimRank
Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14
SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many of the above mentioned applications. While there are many algorithms proposed so far to compute SimRank, but unfortunately, none of them are scalable up to graphs of billions size. Motivated by this fact, we consider the following SimRankbased similarity search problem: given a query vertex u, find top-k vertices v with the k highest SimRank scores s (u, v) with respect to u. We propose a
... very fast and scalable algorithm for this similarity search problem. Our method consists of the following ingredients: 1. We first introduce a "linear" recursive formula for Sim-Rank. This allows us to formulate a problem that we can propose a very fast algorithm. 2. We establish a Monte-Carlo based algorithm to compute a single pair SimRank score s (u, v), which is based on the random-walk interpretation of our linear recursive formula. 3. We empirically show that SimRank score s(u, v) decreases rapidly as distance d (u, v) increases. Therefore, in order to compute SimRank scores for a query vertex u for our similarity search problem, we only need to look at very "local" area. 4. We can combine two upper bounds for SimRank score s(u, v) (which can be obtained by Monte-Carlo simulation in our preprocess), together with some adaptive sample technique, to prune the similarity search procedure. This results in a much faster algorithm. Once our preprocess is done (which only takes O(n) time), our algorithm finds, given a query vertex u, top-20 similar vertices v with the 20 highest SimRank scores s (u, v) in less than a few seconds even for graphs with billions edges. * supported by JST, ERATO, Kawarabayashi Project Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. To the best of our knowledge, this is the first time to scale for graphs with at least billions edges(for the single source case). 2. Based on this linear recursive formula, single-pair Sim-Rank score s (u, v) can be computed very efficiently by Monte-Carlo simulation. Indeed, the time complexity is independent of the size of networks (e.g., n, m). 3. We observe that SimRank score s(u, v) decays very rapidly as distance of the pair u, v increases. 4. By the above observation, we establish upper bounds of SimRank score s(u, v) that only depend on distance d (u, v). The upper bounds can be efficiently computed by Monte-Carlo simulation (in our preprocess). These upper bounds, together with some adaptive sample technique, allow us to effectively prune the similarity search procedure. Combining these ingredients, we can obtain the following algorithm for top-k similarity search problem (Problem 1).