Topic-oriented collaborative crawling

Chiasen Chung, Charles L. A. Clarke
2002 Proceedings of the eleventh international conference on Information and knowledge management - CIKM '02  
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page
more » ... assifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.
doi:10.1145/584792.584802 dblp:conf/cikm/ChungC02 fatcat:a5aatkrugbdonkjed3aty6exvu