Combining clustering of variables and feature selection using random forests: the CoV/VSURF procedure

Marie Chavent 1 Robin Genuer 2, * Jerome Saracco 1
* Auteur correspondant
1 CQFD - Quality control and dynamic reliability
IMB - Institut de Mathématiques de Bordeaux, Inria Bordeaux - Sud-Ouest
2 SISTM - Statistics In System biology and Translational Medicine
Epidémiologie et Biostatistique [Bordeaux], Inria Bordeaux - Sud-Ouest
Abstract : High-dimensional data classification is a challenging problem. A standard approach to tackle this problem is to perform variables selection, e.g. using step-wise or LASSO procedures. Another standard way is to perform dimension reduction, e.g. by Principal Component Analysis or Partial Least Square procedures. The approach proposed in this paper combines both dimension reduction and variables selection. First, a procedure of clustering of variables is used to built groups of correlated variables in order to reduce the redundancy of information. This dimension reduction step relies on the R package ClustOfVar which can deal with both numerical and categorical variables. Secondly, the most relevant synthetic variables (which are numerical variables summarizing the groups obtained in the first step) are selected with a procedure of variable selection using random forests, implemented in the R package VSURF. Numerical performances of the proposed methodology called CoV/VSURF are compared with direct applications of VSURF or random forests on the original $p$ variables. Improvements obtained with the CoV/VSURF procedure are illustrated on two simulated mixed datasets (cases $n>p$ and $n<
Type de document :
Pré-publication, Document de travail
2016
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01345840
Contributeur : Robin Genuer <>
Soumis le : mardi 23 août 2016 - 16:40:29
Dernière modification le : lundi 29 août 2016 - 13:09:39

Fichiers

version_arxiv.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01345840, version 2
  • ARXIV : 1608.06740

Collections

Citation

Marie Chavent, Robin Genuer, Jerome Saracco. Combining clustering of variables and feature selection using random forests: the CoV/VSURF procedure. 2016. 〈hal-01345840v2〉

Partager

Métriques

Consultations de
la notice

268

Téléchargements du document

248