A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
2011
ACM Transactions on the Web
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page's content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when
doi:10.1145/1993053.1993057
fatcat:vmksgqzywvgwhdtj4jzylrd6qi