Julián Alarte, David Insa, Josep Silva, Salvador Tamarit
2015 Proceedings of the 24th International Conference on World Wide Web - WWW '15 Companion  
This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a predefined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism
more » ... ncreases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.
doi:10.1145/2740908.2742835 dblp:conf/www/AlarteIST15 fatcat:3eqfrewcgbghbmd2o2bjqj4erm