Classifying non-gaussian and mixed data sets in their natural parameter space

Cecile Levasseur, Uwe F. Mayer, Ken Kreutz-Delgado
2009 2009 IEEE International Workshop on Machine Learning for Signal Processing  
We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a
more » ... wer dimensional parameter subspace. Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification. As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented. c c c
doi:10.1109/mlsp.2009.5306227 fatcat:jjftmx3ffvhj3ieojh2upflhyq