Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems

Morteza Rezaei Sharifabadi, Seyed Ahmad Eftekhari
2016 Forum for Information Retrieval Evaluation  
In this paper we introduce Mahak Samim, a plagiarism detection corpus that consists of Persian academic texts in which plagiarism cases are embedded. This corpus, which can be used for evaluating plagiarism detection systems, consists of more than five thousand artificial plagiarism cases with various lengths and diverse degrees of obfuscation. The development process and the features of the corpus are described here. CCS Concepts • Information systems ➝ Information retrieval ➝ Retrieval tasks
more » ... nd goals ➝ Near-duplicate and plagiarism detection.
dblp:conf/fire/SharifabadiE16 fatcat:ke3usghg6veq3m24p6wetcuvxy