Research on Similarity Detection of Massive Text Based on Semantic Fingerprint

Xiaolin Jin, Shuwu Zhang, Jie Liu, Hu Guan
2018 Proceedings of Information Science and Cloud Computing — PoS(ISCC 2017)   unpublished
In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar
more » ... xts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact. ISCC2017
doi:10.22323/1.300.0009 fatcat:4gws6g23sbawbpnwg2xroz2o64