Constructing and Evaluating a Novel Crowdsourcing-based Paraphrased Opinion Spam Dataset

Seongsoon Kim, Seongwoon Lee, Donghyeon Park, Jaewoo Kang
2017 Proceedings of the 26th International Conference on World Wide Web - WWW '17  
Opinion spam, intentionally written by spammers who do not have actual experience with services or products, has recently become a factor that undermines the credibility of information online. In recent years, studies have attempted to detect opinion spam using machine learning algorithms. However, limitations of goldstandard spam datasets still prove to be a major obstacle in opinion spam research. In this paper, we introduce a novel dataset called Paraphrased OPinion Spam (POPS), which
more » ... s a new type of review spam that imitates real human opinions using crowdsourcing. To create such a seemingly truthful review spam dataset, we asked task participants to paraphrase truthful reviews, and include factual information and domain knowledge in their reviews. The classification experiments and semantic analysis results show that our POPS dataset most linguistically and semantically resembles truthful reviews. We believe that our new deceptive opinion spam dataset 1 will help advance opinion spam research.
doi:10.1145/3038912.3052607 dblp:conf/www/KimLPK17 fatcat:zqh6awgxezasri7epaqh6aojle