LeSSA: A Unified Framework based on Lexicons and Semi-Supervised Learning Approaches for Textual Sentiment Classification

Jawad Khan, Young-Koo Lee
2019 Applied Sciences  
Sentiment Analysis (SA) is an active research area. SA aims to classify the online unstructured user-generated contents (UUGC) into positive and negative classes. A reliable training data is vital to learn a sentiment classifier for textual sentiment classification, but due to domain heterogeneity, manually construction of reliable labeled sentiment corpora is a laborious and time-consuming task. In the absence of enough labeled data, the alternative usage of sentiment lexicons and
more » ... ed learning approaches for sentiment classification have substantially attracted the attention of the research community. However, state-of-the-art techniques for semi-supervised sentiment classification present research challenges expressed in questions like the following. How to effectively utilize the concealed significant information in the unstructured data? How to learn the model while considering the most effective sentiment features? How to remove the noise and redundant features? How to refine the initial training data for initial model learning as the random selection may lead to performance degradation? Besides, mainly existing lexicons have trouble with word coverage, which may ignore key domain-specific sentiment words. Further research is required to improve the sentiment lexicons for textual sentiment classification. In order to address such research issues, in this paper, we propose a novel unified sentiment analysis framework for textual sentiment classification called LeSSA. Our main contributions are threefold. (a) lexicon construction, generating quality and wide coverage sentiment lexicon. (b) training classification models based on a high-quality training dataset generated by using k-mean clustering, active learning, self-learning, and co-training algorithms. (c) classification fusion, whereby the predictions from numerous learners are confluences to determine final sentiment polarity based on majority voting, and (d) practicality, that is, we validate our claim while applying our model on benchmark datasets. The empirical evaluation of multiple domain benchmark datasets demonstrates that the proposed framework outperforms existing semi-supervised learning techniques in terms of classification accuracy.
doi:10.3390/app9245562 fatcat:adzlvshbmbfklew457auwrh7ue