Defining and Evaluating Blog Characteristics

Fernando Perez Tellez, David Pinto, John Cardiff, Paolo Rosso
2009 2009 Eighth Mexican International Conference on Artificial Intelligence  
The analysis of weblogs has become a popular area of natural language processing. Due to their specific characteristics, such as shortness, vocabulary size and nature, etc. it can be difficult to achieve good results using automated clustering techniques. In particular, their nature can vary considerably, both in length and in breadth of topic. Without a priori knowledge of the nature of a blog it is difficult to achieve accurate clustering results. In this paper, we present a framework for the
more » ... assessment of a set of corpus features that will provide us with insight into their nature from a number of perspectives including shortness, broadness and class imbalance. This in turn allows us to assess the relative hardness of the clustering task and to identify components that can improve the accuracy of the clustering task. We furthermore present the results of some experiments in which we analyzed the features of two sample blog corpora, and we compared the results with other kinds of short texts.
doi:10.1109/micai.2009.21 fatcat:6yp6aeerj5bk7pabulqfc6arcm