Adaptive record extraction from web pages

Justin Park, Denilson Barbosa
2007 Proceedings of the 16th international conference on World Wide Web - WWW '07  
We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns. We compare our method experimentally to the stateof-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records. (such as entries on blogs).
doi:10.1145/1242572.1242838 dblp:conf/www/ParkB07 fatcat:iwlnf6pq7ndn7ffzjacewjhc6u