Stability of model selection for high-dimensional data

Emeline Perthame; Chloé Friguet

Communication Dans Un Congrès Année : 2013

Stability of model selection for high-dimensional data

(1) , (2)

1
2

Emeline Perthame

Fonction : Auteur
PersonId : 185133
IdHAL : emeline-perthame
ORCID : 0000-0002-8266-5908
IdRef : 193507846

Institut de Recherche Mathématique de Rennes

Chloé Friguet

Fonction : Auteur
PersonId : 183883
IdHAL : chloefriguet
ORCID : 0000-0003-2827-0283
IdRef : 148745504

Laboratoire de Mathématiques de Bretagne Atlantique

Résumé

The analysis of data generated by high throughput technologies such as DNA microarrays has markedly renewed the statistical methodology for multiple testing and feature selection in regression or classification issues. Such data are characterized by both their high-dimension, as the number of measured features is close to several thousands whereas the sample size is about some tens, and their heterogeneity, as the true signal and several confusing factors (uncontrolled and unobserved) are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions for example. Some recent papers (Efron 2007, Leek and Storey 2007 and 2008; Friguet et al, 2009 ) have focused on the negative impact of data heterogeneity on the consistency of the ranking which results from multiple testing procedures. This presentation aims at showing that data heterogeneity also a effects the stability of supervised classification model selection which is often used to identify relevant subsets of features. Key characteristics of selection methods are both classification or prediction performance and reproducibility of the selected variables to perturbation in the data. It is first shown that selected subsets using well-known procedures such as LASSO (Tibshirani, 1996) are subject to a high variability. The stability of this selection method is compared through a simulation study, considering several scenario of dependence between variables: independence, block dependence, factor structure and Toeplitz design (as also considered in Meinshausen and Buhlmann, 2010). Simulation studies show that most usual methods do not select theoretical best predictors and that interesting performances of classification are performed only when a high number of variables are selected. As suggested in Friguet et al. (2009), a supervised factor model is proposed to identify a low-dimensional linear kernel which captures data dependence and new strategies for model selection are deduced. This new strategy is finally shown to improve stability of the usual methods. Indeed, interesting performances of classification are reached for a smaller number of selected variables and best theoretical predictors are more often selected for structures with a high degree of dependence.

Mots clés

variable selection high-dimension stability factor model

Domaines

Applications [stat.AP] Méthodologie [stat.ME]

Chloé Friguet : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00913709

Soumis le : mercredi 4 décembre 2013-11:59:27

Dernière modification le : lundi 11 mars 2024-14:40:18

Dates et versions

hal-00913709 , version 1 (04-12-2013)

Identifiants

HAL Id : hal-00913709 , version 1

Citer

Emeline Perthame, Chloé Friguet. Stability of model selection for high-dimensional data. Statistical Methods for (post)-Genomics Data (SMPGD2013), Jan 2013, Amsterdam, Netherlands. ⟨hal-00913709⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-BREST UNIV-RENNES1 IRMAR UR2-HB CNRS INSA-RENNES UNAM UBS UR1-MATH-STIC UNIV-RENNES2 UNIV-RENNES INSA-GROUPE IBNM UR1-MATH-NUM

175 Consultations

0 Téléchargements

Stability of model selection for high-dimensional data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager