A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2007; you can also visit the original URL.
The file type is application/pdf
.
Extracting XML data from HTML repositories
2004
There is a vast amount of valuable information in HTML documents, widely distributed across the World Wide Web and across corporate intranets. Unfortunately, HTML is mainly presentation oriented and hard to query. While XML is becoming a standard for online data representation and exchange, there is a huge amount of legacy HTML data containing potentially untapped information. We develop a system to extract desired information (records) from thousands of HTML documents, starting from a small
doi:10.14288/1.0091527
fatcat:pwmphbzu75e7fazaavmvrumh7e