Computing PageRank in a Distributed Internet Search System [chapter]

Yuan Wang, David J. DeWitt
2004 Proceedings 2004 VLDB Conference  
Existing Internet search engines use web crawlers to download data from the Web. Page quality is measured on central servers, where user queries are also processed. This paper argues that using crawlers has a list of disadvantages. Most importantly, crawlers do not scale. Even Google, the leading search engine, indexes less than 1% of the entire Web. This paper proposes a distributed search engine framework, in which every web server answers queries over its own data. Results from multiple web
more » ... ervers will be merged to generate a ranked hyperlink list on the submitting server. This paper presents a series of algorithms that compute PageRank in such framework. The preliminary experiments on a real data set demonstrate that the system achieves comparable accuracy on PageRank vectors to Google's wellknown PageRank algorithm and, therefore, high quality of query results. 1 167 TB in surface web, 91,850 TB in deep web, 18.7 KB per page [19]. 2 Claimed on http://www.google.com as of June 2004.
doi:10.1016/b978-012088469-8.50039-5 dblp:conf/vldb/WangD04 fatcat:stytw3tawrcrlm3d6fln6sukqy