Improving document clustering using automated machine translation

Xiang Wang, Buyue Qian, Ian Davidson
2012 Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12  
With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative
more » ... ch to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.
doi:10.1145/2396761.2396844 dblp:conf/cikm/WangQD12 fatcat:k3idu2evvvexxhpb23gvezvgam