Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Guoyu Tang, Yunqing Xia, Erik Cambria, Peng Jin, Thomas Fang Zheng
2015 International journal of pattern recognition and artificial intelligence  
Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word
more » ... sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two 1 November 10, 2014 15:25 WSPC/INSTRUCTION FILE ijprai-clslda-r1 2 Guoyu Tang et al. state-of-the-art methods for cross-lingual document clustering.
doi:10.1142/s021800141559003x fatcat:eevqzoolcjhlxex523jjm2q74m