Automatic identification of informative sections of Web pages

S. Debnath, P. Mitra, N. Pal, C.L. Giles
2005 IEEE Transactions on Knowledge and Data Engineering  
Web-pages -especially dynamically generated ones -contain several items that cannot be classified as the "primary content", e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from Web-pages automatically, must separate the "primary content sections" from the other content sections. We call
more » ... these sections as "Web-page blocks" or just "blocks". First, a tool must segment the Web-pages into Web-page blocks and second, the tool must separate the primary content blocks from the non-informative content blocks. In this paper, we formally define Web-page blocks and devise a new algorithm to partition an HTML page into constituent Web-page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by (i) looking for blocks that do not occur a large number of times across Web-pages, by (ii) looking for blocks with desired features, and by (iii) using classifiers, trained with block-features respectively. While operating on several thousand Web-pages obtained from various Websites, our algorithms outperform several existing algorithms with respect to runtime and accuracy. Furthermore, we show that a Web-cache system that applies our algorithms to remove non-informative content blocks and to identify similar blocks across Web-pages can achieve significant storage savings.
doi:10.1109/tkde.2005.138 fatcat:2iataz2htvfjbkurg2sqpdkmxi