Towards Automatic Structured Web Data Extraction System

Tomas Grigalis
2012 International Baltic Conference on Databases and Information Systems  
Automatic extraction of structured data from web pages is one of the key challenges for the Web search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to derive data extraction rules. The preliminary evaluation results of ClustVX system on
more » ... e public benchmark datasets demonstrate a high efficiency and indicate a need for a much bigger up-to-date benchmark data set that reflects contemporary WEB 2.0 web pages.
dblp:conf/balt/Grigalis12 fatcat:ihofkulkvrdlletzujuwerpx3m