A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2006; you can also visit the original URL.
The file type is application/pdf
.
Data extraction from Web data sources
2004
Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004.
This paper provides an explanation of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by web sites in response to user qeries via web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw html code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
doi:10.1109/dexa.2004.1333487
dblp:conf/dexaw/Robinson04
fatcat:t7od7klqy5bklmfvmwmqnnfswe