Novel Unsupervised Features for Czech Multi-label Document Classification [chapter]

Tomáš Brychcín, Pavel Král
2014 Lecture Notes in Computer Science  
This paper deals with automatic multi-label document classification in the context of a real application for the Czech News Agency. The main goal of this work consists in proposing novel fully unsupervised features based on an unsupervised stemmer, Latent Dirichlet Allocation and semantic spaces (HAL and COALS). The proposed features are integrated into the document classification task. Another interesting contribution is that these two semantic spaces have never been used in the context of
more » ... ment classification before. The proposed approaches are evaluated on a Czech newspaper corpus. We experimentally show that almost all proposed features significantly improve the document classification score. The corpus is freely available for research purposes.
doi:10.1007/978-3-319-13647-9_8 fatcat:heooyb77lnch7bkfherzecelea