Automatic Detection of Webpages that Share the Same Web Template

Julián Alarte, David Insa, Josep Silva, Salvador Tamarit
2014 Electronic Proceedings in Theoretical Computer Science  
Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a
more » ... bsite that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.
doi:10.4204/eptcs.163.2 fatcat:eceoyul67bbejouztesmu55a2y