THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY

Đorđe Petrović, Milena Stanković
2019 Facta Universitatis Series Mathematics and Informatics  
Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional
more » ... mensional represen-tation of text documents. We then analyzed the influence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods specific to the Serbian language for the purpose of calculating text similarity can lead to great differences in the results.
doi:10.22190/fumi1905973d fatcat:ufhdygfjubfvpng5oioqr3tqie