Data extraction from Web data sources

J. Robinson
2004 Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004.  
This paper provides an explanation of the basic data structures used in a new page analysis technique to create wrappers (data extractors) for the result pages produced by web sites in response to user qeries via web page forms. The key structure called a tpGrid is a representation of the web page, which is easier to analyse than the raw html code. The analysis looks for repetition patterns of sets of tagSets, which are defined in the paper.
doi:10.1109/dexa.2004.1333487 dblp:conf/dexaw/Robinson04 fatcat:t7od7klqy5bklmfvmwmqnnfswe