Semantic analysis of offensive language categories from existing annotated corpora

Maša Kljun, Matija Teršek, Slavko Žitnik
2022 Uporabna informatika  
There exists a vast amount of different offensive language corpora for English language, annotation criteria and category naming. In this paper, we explore 21 different categories of offensive language. We use natural language processing techniques to find correlations between the categories based on seven different data sets. We employ several traditional (TF–IDF) and advanced (fastText, GloVe, Word2Vec, BERT, and other deep NLP methods) techniques to uncover similarities among different
more » ... ive language categories. The findings reveal that most of the categories are densely interconnected, while a two-level hierarchical representation of them can be provided. We also transfer the analysis to the Slovenian language and compare the findings between both researched languages.
doi:10.31449/upinf.vol30.num1.151 fatcat:mrrny5ynznhlbek2ezr7g4fa4m