Learning URL patterns for webpage de-duplication

Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, Amit Sasturkar
2010 Proceedings of the third ACM international conference on Web search and data mining - WSDM '10  
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging
more » ... to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework. Stage II: Associates deep tokenized key, value pairs to the original URL and constructs deep tokenized URLs (U RL dt ). M ap : Host Host, U RL dt Pair-wise Rule Generation Generates pair-wise Rules from URL pairs of a duplicate cluster. dupC stands for dup cluster id and source rank and target rank stand for source and target selection rank. c and t stand for context and transformation of the pair-wise Rule. Rule Generalization Stage I: Generates frequency for each < c key i , c val ij , t > where c key i , c val ij represents key value pair of a Rule context. M ap : c, t → Host, t, {< c key i , c val ij >} Red : Host, t, {< c key i , c val ij >} → {< Host, t, c key i , c val ij , f req >} Stage II: Generalizes contexts using algorithm 5. cgen stands for generalized context.
doi:10.1145/1718487.1718535 dblp:conf/wsdm/KoppulaLACGS10 fatcat:2z4jhswpofc6rjxujshkjlnnuq