Sentiment Analysis on News Comments Based on Supervised Learning Method

Yan Zhao, Suyu Dong, Leixiao Li
2014 International Journal of Multimedia and Ubiquitous Engineering  
Up to now, sentiment analysis has become one of most active research ares in NLP, many researchers have conducted sentiment analysis for foreign language documents. Compared with the researches of foreign language documents, there are few studies on sentiment classification of Chinese document, and fewer studies on news comments. This paper presents a research of sentiment analysis on news comments. In this paper, we adopt four feature selection methods (DF, IG, CHI, MI), three feature
more » ... ations (Presence, TF, TF-IDF) and five learning methods (NB, ME, Winnow, C4.5, SVM) for the sentiment analysis of Chinese news comments. The experimental results indicate that, except MI, other three feature selection methods are all suitable for selecting features for news comments, and through comprehensive assessment of feature selection method, CHI is better; TF performs the best calculation of feature weighting; ME outperforms other classifiers for the sentiment classification. (2)Which is the best classifier (Winnow, C.5, NB, ME and SVM) for the sentiment classification of news comments? (3)Which feature representation (Presence, TF, TF-IDF) is the best method regarding news comments classification? Related Works Supervised learning method and unsupervised learning method are the main technologies of sentiment classification. In this paper, we apply supervised learning method to sentiment classification of news comments. For this method, the key problems are text vectorization and training classifier. Text vectorization contains extracting features and the calculation of feature weighting. The high feature dimensions are the critical problem of sentiment classification. In the vector space of features, many features are useless for the sentiment classification or even will lower the efficiency and the effectiveness. Hence, dimensionality reduction is important for sentiment classification and a good feature selection method is a good way of dimension reduction. Wrappers and filters are two kinds method of feature selection of machine learning [9] . Wrappers spend plenty of time when it used for feature selection especially for the high-dimensional of space vector. Thus, wrappers are not suitable for feature selection of sentiment classification [10] . Filters are frequently used for extracting features of sentiment classification. They use the evaluation metric to measure the ability of terms for the classification and then to extract features. There are many methods for filters, such as IG, CHI, DF, MI, OR and so on. Up to now, a number of researchers focus on feature selection. [11] evaluated five feature selection methods for text classification and found that IG and CHI were the most effective methods; [12] proved that CHI was the best feature selection method for four classifiers of text categorization; [13] performed the binary classification with SVM and twelve methods of feature selection, the experiment result indicated that new method BNS(Bi-Normal Separation) was the best method; [14] improved Gini index theory and showed that the novel method was better than other feature selection. Compared with the studies of feature selection of text classification, the same researches for sentiment classification are fewer. [15, 16] showed the experiment result of feature selection for sentiment classification, [15] proved that IG outperformed other feature selection methods (DF, CHI, MI), [16] indicated that DF was the most suitable for sentiment classification. A large number of researches proved that various documents employ different methods of feature selection can reach the best accuracy of sentiment classification. This paper will research the feature selection of sentiment classification of news comments. When we use machine learning method to perform sentiment classification, text feature weighting is necessary after feature selection. Feature weighting methods mainly are Presence, TF, TF-IDF. [17] compared Presence and TF as the feature representation methods of sentiment classification of movie reviews, the result showed that Presence outperformed TF; [18] proved that NB with Presence can achieve the top accuracy for the sentiment classification of Internet restaurant reviews written in Cantonese, SVM with different n-grams need different feature weighting methods to achieving its best accuracy; [16] used Boolean weight and various feature selection to sentiment classification. In this paper, the experiment adopts Presence, TF and TF-IDF to sentiment classification of news comments to gain the best method. The classification technology is important for sentiment classification. So far, many researches of sentiment classification used machine learning. Naive Bayes, maximum entropy and support vector machine are often used for sentiment classification. [7, [18] [19] [20] used SVM and NB to sentiment classification of different documents, [20] showed SVM was better than NB for the sentiment classification of travel reviews; [18, 19] proved that compared with NB, SVM was not a universal winner; [7] used more features for sentiment classification and showed that the accuracies were comparable for SVM and NB. [17] 335 compared SVM, NB with ME for sentiment classification of movie reviews, the experiment result showed that SVM was the best classifier. Fewer researches focused on Winnow and C4.5 for sentiment classification. [6] used Winnow, PA and LM to sentiment classification of product reviews; [21] adopted five classifiers(Centroid classifier, KNN classifier, NB classifier, Winnow classifier, SVM classifier ) for sentiment classification of product reviews and found that SVM outperformed other classifiers; [22] proved the effect of sentiment classification of SVM was not better than C4.5 anytime. The above work shows that different documents use different machine learning technologies can reach the best effect of sentiment classification. This paper will compare the utility of five classifiers (SVM, NB, ME, Winnow, C4.5) which is used for sentiment classification of news comments. Winnow classifier, CHI is the best feature selection method; The effect of C4.5 with DF and CHI is better than C4.5 with other feature selection methods; For NB classifier, comparing the mean value of accuracies, the highest accuracy was achieved by NB with CHI; The statistic result of ME indicates that ME with IG yields the top accuracy. The statistic data denotes that, the accuracies of SVM, NB, ME with CHI, DF and IG are all larger than 80%. Observing the mean value of accuracies of four feature selection methods, CHI achieves the highest accuracy (82.17%), the accuracy of IG(80.43%) is slightly higher than DF(80.38%). The gap of accuracies between MI and other feature selection methods is lager than 16%. Combining accuracy, positive precision, negative precision and analyzing the experiment results of sentiment classification of news comments, the descending order of performance of classifiers is ME>SVM>NB>Winnow>C4.5. In a word, ME is the best classifiers for sentiment classification of news comments. Although the effect of SVM and NB is worse than ME, they can be used for sentiment classification of news comments. Winnow and C4.5 are not suitable for sentiment classification of news comments. In conclusion, DF, IG and CHI can be used for sentiment classification of news comments, and the effect of DF, IG and CHI is similar. Hence, when choosing DF, IG or CHI as the feature selection method, the difference of results comes from classifiers themselves. With the improvement of classifiers, the accuracy of sentiment classification can be improved.
doi:10.14257/ijmue.2014.9.7.28 fatcat:52esqrafkbekjkoxhplbyygvla