Variable clustering in high dimensional linear regression models

Loïc Yengo 1, 2, 3 Julien Jacques 2, 3 Christophe Biernacki 2, 3
2 MODAL - MOdel for Data Analysis and Learning
Inria Lille - Nord Europe, LPP - Laboratoire Paul Painlevé - UMR 8524, CERIM - Santé publique : épidémiologie et qualité des soins-EA 2694, Polytech Lille, Université de Lille 1, IUT’A
Abstract : For the last three decades, the advent of technologies for massive data collection have brought deep changes in many scientific fields. What was first seen as a blessing, rapidly turned out to be termed as the curse of dimensionality. Reducing the dimensionality has therefore become a challenge in statistical learning. In high dimensional linear regression models, the quest for parsimony has long been driven by the idea that a few relevant variables may be sufficient to describe the modeled phenomenon. Recently, a new paradigm was introduced in a series of articles from which the present work derives. We propose here a model that simultaneously performs variable clustering and regression. Our approach no longer considers the regression coefficients as fixed parameters to be estimated, but as unobserved random variables following a Gaussian mixture model. The latent partition is then determined by maximum likelihood and predictions are obtained from the conditional distribution of the regression coefficients given the data. The number of latent components is chosen using a BIC criterion. Our model has very competitive predictive performances compared to standard approaches and brings significant improvements in interpretability.
Type de document :
Article dans une revue
Journal de la Société Française de Statistique, Société Française de Statistique et Société Mathématique de France, 2014, 155 (2), pp.19
Liste complète des métadonnées

Littérature citée [25 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-00764927
Contributeur : Julien Jacques <>
Soumis le : vendredi 2 août 2013 - 13:50:09
Dernière modification le : mardi 13 décembre 2016 - 15:45:50
Document(s) archivé(s) le : lundi 4 novembre 2013 - 17:21:39

Fichier

CLERE.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00764927, version 2

Collections

Citation

Loïc Yengo, Julien Jacques, Christophe Biernacki. Variable clustering in high dimensional linear regression models. Journal de la Société Française de Statistique, Société Française de Statistique et Société Mathématique de France, 2014, 155 (2), pp.19. 〈hal-00764927v2〉

Partager

Métriques

Consultations de
la notice

575

Téléchargements du document

402