Topic Modeling the Hàn diăn Ancient Classics (汉典古籍)
Colin Allen, Hongliang Luo, Jaimie Murdoc, Jianghuai Pu, Xiaohong Wang, Yanjie Zhai, Kun Zhao
2017
Journal of Cultural Analytics
There is a small but growing literature on large-scale statistical modeling of Chinese language texts. Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources. This was used to plot the temporal distributions of word frequencies and geographic distributions of authors. Huang and Yu modeled the SongCi poetry corpus, first converting it to tonally marked pinyin to conserve poetically important pronunciation information. Nichols and colleagues reported initial
more »
... eling of the Chinese Text Project corpus 1 in a conference paper. (Further below, we describe differences between this corpus and the Handian.) With additional collaborators, this group has now conducted two studies that are currently unpublished but under review. In the first, they apply topic models to address scholarly questions about the relationships among important texts of Ancient Chinese philosophy. In the second, they use topic modeling to investigate the concepts of mind and body in ancient Chinese philosophy. Although we share similar scholarly objectives with these researchers, our approach in this paper is unique in that for the first time anywhere we bring the benefits of computational modeling of ancient Chinese texts to a robust public platform that is mirrored on both sides of the Pacific. Besides being just a useful portal to the texts, our approach foregrounds the interpretive issues surrounding topic models, and makes more sophisticated exploration and analysis of interpretive questions possible for experts and novices alike. There is a small but growing literature on large-scale statistical modeling of Chinese language texts. Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources. This was used to plot the temporal distributions of word frequencies and geographic distributions of authors. 2 Huang and Yu modeled the SongCi poetry corpus, first converting it to tonally marked pinyin to conserve poetically important pronunciation information. 3 Nichols and colleagues reported initial modeling of the Chinese Text Project corpus 4 in a conference paper. (Further below, we describe differences between this corpus and the Handian.) With additional collaborators, this group has now conducted two studies that are currently unpublished but under review. In the first, they apply topic models to address scholarly questions about the relationships among
doi:10.22148/001c.11882
fatcat:qr6evqxewndq3es3aah65v3ze4