Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase

Erfaneh Gharavi, Hadi Veisi, Paolo Rosso
2019 Neural computing & applications (Print)  
The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text
more » ... vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method. In this paper, we improved our method for text similarity detection applied in the text alignment shared-task (Gharavi et al., 2016) . This new enhanced approach is employed on different datasets from various languages and obfuscation type perspectives. In this task, a pair of documents is given to identify the possible plagiarism among them. Due to the efficiency issues, we apply word vector averaging to represent a sentence. Obfuscation type was also taken into account while comparing the text parts. In the proposed method, we employed two approaches for filtering the detected plagiarism cases. The first approach tuned the required threshold by lots of trials over a training dataset. We also set the threshold by considering obfuscation type, which improves the total performance of the system. In the second approach, two methods are employed to remove outliers in the filtering phase, which make threshold tuning among the training data set inessential. These methods automatically tune the threshold with respect to obfuscation type. The main advantages of our approach among others are its simplicity and its fast candidate selection. We accelerated the process by transforming a n-gram-by-n-gram comparison, so-called string-matching approach, to a numerical one. Synthetic changes in sentences, including alteration in word order which resulted in the same representation of sentences can be detected conveniently with our approach. This method can easily identify transformations in vocabulary, including adding or omitting words, which would be indistinguishable by string matching approaches. Furthermore, we proposed a new approach for threshold tuning that omits the training phase. The system could be run in real time and with almost no training data. We proved the scalability of our obfuscation-aware language-independent approach to any new corpus without training data, for different languages on the diverse datasets. The rest of this paper is organized as follows: Section 2 presents the related works for the text alignment task. Section 3 illustrates the text embedding representation. Section 4 defines the proposed method and Section 5 illustrates the experiments, datasets, evaluation metrics and results. Finally, we illuminate the characteristics of our method in the conclusion section.
doi:10.1007/s00521-019-04594-y fatcat:tilsahe55vdhbl6pajcdtenogi