Quantifying Semantics using Complex Network Analysis

Chris Biemann, Stefanie Roos, Karsten Weihe
2012 International Conference on Computational Linguistics  
Though it is generally accepted that language models do not capture all aspects of real language, no adequate measures to quantify their shortcomings have been proposed until now. We will use n-gram models as workhorses to demonstrate that the differences between natural and generated language are indeed quantifiable. More specifically, for two algorithmic approaches, we demonstrate that each of them can be used to distinguish real text from generated text accurately and to quantify the
more » ... ce. Therefore, we obtain a coherent indication how far a language model is from naturalness. Both methods are based on the analysis of co-occurrence networks: a specific graph cluster measure, the transitivity, and a specific kind of motif analysis, where the frequencies of selected motifs are compared. In our study, artificial texts are generated by n-gram models, for n = 2, 3, 4. We found that, the larger n is chosen, the narrower the distance between generated and natural text is. However, even for n = 4, the distance is still large enough to allow an accurate distinction. The motif approach even allows a deeper insight into those semantic properties of natural language that evidently cause these differences: polysemy and synonymy. To complete the picture, we show that another motif-based approach by Milo et al. (2004) does not allow such a distinction. Using our method, it becomes possible for the first time to measure generative language models deficiencies with regard to semantics of natural language.
dblp:conf/coling/BiemannRW12 fatcat:nylto3glx5fg5j7gldlw5ypy3e