An adaptive crawler for locating hiddenwebentry points

Luciano Barbosa, Juliana Freire
2007 Proceedings of the 16th international conference on World Wide Web - WWW '07  
In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers
more » ... ically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates-the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.
doi:10.1145/1242572.1242632 dblp:conf/www/BarbosaF07a fatcat:jjwhm5ppojbnbfetxuuvmqae6m