Adaptive web-page content identification

John Gibson, Ben Wellner, Susan Lubar
2007 Proceedings of the 9th annual ACM international workshop on Web information and data management - WIDM '07  
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly
more » ... act content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.
doi:10.1145/1316902.1316920 dblp:conf/widm/GibsonWL07 fatcat:763ccxjdmfab5lvktlkaz374ku