Language-independent Gender Prediction on Twitter

Nikola Ljubešić, Darja Fišer, Tomaž Erjavec
2017 Proceedings of the Second Workshop on NLP and Computational Social Science  
In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on languageindependent features extracted either from the text or the metadata of users' tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and
more » ... esting on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.
doi:10.18653/v1/w17-2901 dblp:conf/acl-nlpcss/LjubesicFE17 fatcat:togkpwez7beodjl63esiuwbdbu