Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering

Anusua Trivedi, Piyush Rai, Hal Daumé, Scott L. Duvall
2012 ACM Transactions on Intelligent Systems and Technology  
Automatic clustering of webpages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, webpage clustering algorithms only use features extracted from the page-text. However, the advent of social-bookmarking websites, such as StumbleUpon.com and Delicious.com, has led to a huge amount of user-generated content such as the social tag information that is associated with the webpages. In this
more » ... paper, we present a subspace based feature extraction approach which leverages the social tag information to complement the page-contents of a webpage for extracting beter features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We then present an extension that allows our approach to be applicable even if the webpage corpus is only partially tagged, i.e., when the social tags are present for not all, but only for a small number of webpages. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the webpage clustering task. We also discuss some possible future work including an active learning extension that can help in choosing which webpages to get tags for, if we only can get the social tags for only a small number of webpages.
doi:10.1145/2337542.2337552 fatcat:zknw75oor5geheog3aymmn6wwm