VECTOR SPACE MODELS OF KYIV CITY PETITIONS

R.V. Shaptala, G.D. Kyselov
2021 Scientific notes of Taurida National V.I. Vernadsky University. Series: Technical Sciences  
In this study, we explore and compare two ways of vector space model for Kyiv city petitions creation. In order to automatically analyze freeform texts such as petitions, they need to be converted to a numeric space. By leveraging word vectors based on the distributional hypothesis, namely Word2Vec and FastText, we construct vector models of Kyiv city petitions. The overall pipeline that we contribute is training word vectors on the dataset of Kyiv city petitions, preprocessing the documents,
more » ... d applying averaging to create petition vectors. Moreover, this pipeline does not require big data and is applicable to training in a low-resource setting such as the Ukrainian language for which we have only used 4623 unlabeled petitions. No pretrained models and fine-tuning was done for the sake of this research and we provide hyperparameters that were optimal for the experiments. The advantages and disadvantages of both models are analyzed. Word2Vec-based model gets a higher Silhouette Coefficient score and produces more dense clusters than FastText-based one. This makes it more appropriate for real world applications such as petitions sentiment analysis or clustering. Error analysis confirms this result as FastText pays more attention to the syntactic structure of petitions and words while Word2Vec focuses more on the contexts. To support this claim, we show examples of such behavior for the same textual queries on different urban topics. Visualizations of the vector spaces after dimensionality reduction via UMAP are demonstrated in an attempt to show their overall structure. They reinforce the resulting Silhouette Coefficient scores by exhibiting denser clusters for the Word2Vec based approach. The resulting models can be used to effectively query semantically related petitions as well as search for clusters of related petitions.
doi:10.32838/2663-5941/2021.4/26 fatcat:pdtvb5snpbhupgil4gd2nrqdpi