Clustering of Blogs with Enhanced Semantics

A. K. Singh, R. C. Joshi
2011 International Journal of Computer Applications  
Blogs are among the fastest growing space among the user generated content over the internet. It is fast becoming the tool for information dissemination, and communication. Blogs provide a platform for information sharing, discussions, and expression of reader's reactions to the blog post. Clustering of blogs greatly simplify blog searching and browsing by organizing them into similar groups. The Blogs are generally organized using tags. In this paper, we have studied the effect of considering
more » ... ther relevant neighborhood contexts and adding the extracted information to the original tag set carried by the blog. The added semantics is extracted by disambiguating all the synsets for the important terms/ or key phrases within the blog. This work reports the study of measuring similarity, on enhanced blog features and subsequently grouping of all blog articles based on the semantics of the tags they carry. We propose to include the semantics extracted from the title, body, and comments of a blog post to its original tagset in clustering blog documents and evaluate the hypothesis that adding extracted semantics from these blog constituents improves the cluster quality. For clustering k-means algorithm is used. The experimental results obtained confirm our hypothesis that adding the semantics improves better clusters. The approach first extracts the relevant features from the target blog corpus, title and comments. The other senses represented by the relevant keywords are discovered by using a general purpose semantics extractor. All the synsets of the relevant keywords are extracted from the WORDNET. The extracted keyword senses are then appended to the base tagsets. A semantic similarity measure is used for computing the semantic similarity among the documents. Clusters are obtained based on it. The two clusters output are compared.
doi:10.5120/2026-2741 fatcat:nfvh4jun2jgyfdbxldzuo3v3ci