A Plagiarism Detection Approach Based on SVM for Persian Texts

Fezeh Esteki, Faramarz Safi Esfahani
2016 Forum for Information Retrieval Evaluation  
Plagiarism is defined as an unauthorized act of using or adapting others' works and ideas without referring to them. Numerous methods have been proposed to detect plagiarism in different languages; however, not a lot has been accomplished in Persian. The present study has utilized statistical and semantic features to determine the functionality of Support Vector Machines (SVMs) in detecting acts of plagiarism in Persian. To increase accuracy, a stemmer was designed to stem Persian words. The
more » ... tistical and semantic features were used to train and apply the SVM. The statistical features used are Jaccard coefficient, Dice coefficient, Levenshtein distance, and Longest Common Subsequence. To detect semantic similarities, a new method called "Index Words Replacement" was proposed. The proposed framework was tested on PAN data set. The results show the precision of 0.93337, recall of 0.70124 and Plagdet of 0.80083.
dblp:conf/fire/EstekiE16 fatcat:tlxdq4xxkfcjhcq3hwqisppih4