A Very Fast Algorithm for Detecting Partially Plagiarized Documents Using FM-Index

Chang SeokOck, JongKyu Seo, Sung-Hwan Kim, Hwan-Gue Cho
2013 International Journal of Computer and Communication Engineering  
Sequence alignment and fingerprinting are two of the most common methods for plagiarism detection because of their powerful performances. The disadvantage of using these methods is that if the size of the target document is increase, the string processing cost also increases. We use disk-based techniques and Genome assembly used in Next Generation Sequencing (NGS) to overcome this disadvantage. By combining the two methods, we propose a method for very-fast plagiarism detection in a large
more » ... corpus. The method is based on the Burrows-Wheeler Transform (BWT) and the FM-index for BWT search. For efficient detection, we extract initial consonants from the Korean corpus and build data structures for indexing the extracted initial consonants. We then split the suspected plagiarism query document into several pieces and perform the query search. Finally, we analyze the results of the search to detect the plagiarized sections. Our proposed method shows a maximum of 0.96 precision and 1.0 recall. In the future, we plan to investigate various ways of improving the search algorithm through optimization, and user-specific visualization methods. Index Terms-Burrows-wheeler transform, FM-index, plagiarism detection.
doi:10.7763/ijcce.2013.v2.194 fatcat:lvt5msu6gnbz5jnwrlmfr3m4aq