Model-based clustering of Gaussian copulas for mixed data

Matthieu Marbac 1 Christophe Biernacki 1, 2 Vincent Vandewalle 1
1 MODAL - MOdel for Data Analysis and Learning
LPP - Laboratoire Paul Painlevé - UMR 8524, Inria Lille - Nord Europe, CERIM - Santé publique : épidémiologie et qualité des soins-EA 2694, Polytech Lille, Université de Lille 1, IUT’A
Abstract : Clustering task of mixed data is a challenging problem. In a probabilistic framework, the main difficulty is due to a shortage of conventional distributions for such data. In this paper, we propose to achieve the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this copula mixture model defines intra-component dependencies similar to a Gaussian mixture, so with classical correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and also provides tools of visualization based on the parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler. Experiments on simulated and real data sets finally illustrate the expected advantages of the proposed model for mixed data: flexible and meaningful parametrization combined with visualization features.
Type de document :
Article dans une revue
Communications in Statistics - Theory and Methods, Taylor & Francis, 2016
Liste complète des métadonnées

Littérature citée [40 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-00987760
Contributeur : Matthieu Marbac <>
Soumis le : mardi 20 décembre 2016 - 09:43:50
Dernière modification le : jeudi 22 décembre 2016 - 10:35:21
Document(s) archivé(s) le : mardi 21 mars 2017 - 08:55:43

Fichier

article.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00987760, version 4

Collections

Citation

Matthieu Marbac, Christophe Biernacki, Vincent Vandewalle. Model-based clustering of Gaussian copulas for mixed data. Communications in Statistics - Theory and Methods, Taylor & Francis, 2016. 〈hal-00987760v4〉

Partager

Métriques

Consultations de
la notice

438

Téléchargements du document

199