Web data extraction based on structural similarity

Zhao Li, Wee Keong Ng, Aixin Sun
2005 Knowledge and Information Systems  
Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a
more » ... nt schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as a vector of schema, it can be easily incorporated into existing systems as the fabric for integration.
doi:10.1007/s10115-004-0188-z fatcat:vyqafjj67re57bxiciehhxv2cq