Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts [chapter]

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries
2016 Lecture Notes in Computer Science  
Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB ) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes
more » ... s possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.
doi:10.1007/978-3-319-43997-6_11 fatcat:odndkvcgzbdpjcq7pzznn6yvda