A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Learning URL patterns for webpage de-duplication
2010
Proceedings of the third ACM international conference on Web search and data mining - WSDM '10
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging
doi:10.1145/1718487.1718535
dblp:conf/wsdm/KoppulaLACGS10
fatcat:2z4jhswpofc6rjxujshkjlnnuq