Dynamic personalized pagerank in entity-relation graphs
Proceedings of the 16th international conference on World Wide Web - WWW '07
Extractors and taggers turn unstructured text into entityrelation (ER) graphs where nodes are entities (email, paper, person, conference, company) and edges are relations (wrote, cited, works-for). Typed proximity search of the form type=person NEAR company∼"IBM", paper∼"XML" is an increasingly useful search paradigm in ER graphs. Proximity search implementations either perform a Pagerank-like computation at query time, which is slow, or precompute, store and combine per-word Pageranks, which
... Pageranks, which can be very expensive in terms of preprocessing time and space. We present HubRank, a new system for fast, dynamic, spaceefficient proximity searches in ER graphs. During preprocessing, HubRank computes and indexes certain "sketchy" random walk fingerprints for a small fraction of nodes, carefully chosen using query log statistics. At query time, a small "active" subgraph is identified, bordered by nodes with indexed fingerprints. These fingerprints are adaptively loaded to various resolutions to form approximate personalized Pagerank vectors (PPVs). PPVs at remaining active nodes are now computed iteratively. We report on experiments with CiteSeer's ER graph and millions of real Cite-Seer queries. Some representative numbers follow. On our testbed, HubRank preprocesses and indexes 52 times faster than whole-vocabulary PPV computation. A text index occupies 56 MB. Whole-vocabulary PPVs would consume 102 GB. If PPVs are truncated to 56 MB, precision compared to true Pagerank drops to 0.55; in contrast, HubRank has precision 0.91 at 63 MB. HubRank's average query time is 200-300 milliseconds; query-time Pagerank computation takes 11 seconds on average.