A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Adaptive web-page content identification
2007
Proceedings of the 9th annual ACM international workshop on Web information and data management - WIDM '07
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Webbased applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly
doi:10.1145/1316902.1316920
dblp:conf/widm/GibsonWL07
fatcat:763ccxjdmfab5lvktlkaz374ku