Automatic Extraction of Complex Web Data

Ming Zhang, Ying Zhou, Jon Patrick
2006 Pacific Asia Conference on Information Systems  
A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the weblog homepage in HTML format as well. WTM is built upon these two observations. It uses RSS feed
more » ... a to automatically label the corresponding HTML file (weblog homepage) and induces general template rules from the labeled page. The rules can then be used to extract data from other pages of similar layout template. WTM is tested on some selected weblogs and the results are satisfactory.
dblp:conf/pacis/ZhangZP06 fatcat:lspboso6ijfapnseltsffdzehu