A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Untangling compound documents on the web
2003
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia - HYPERTEXT '03
Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonomous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and
doi:10.1145/900069.900070
fatcat:4qyaw6e7tbdf5ekpyoo5g6ggeq