Stability of model selection for high-dimensional data

Emeline Perthame 1 Chloé Friguet 2
2 LMBA_UBS
LMBA - Laboratoire de Mathématiques de Bretagne Atlantique
Abstract : The analysis of data generated by high throughput technologies such as DNA microarrays has markedly renewed the statistical methodology for multiple testing and feature selection in regression or classification issues. Such data are characterized by both their high-dimension, as the number of measured features is close to several thousands whereas the sample size is about some tens, and their heterogeneity, as the true signal and several confusing factors (uncontrolled and unobserved) are often observed at the same time. In such a framework, the usual statistical approaches are questioned and can lead to misleading decisions for example. Some recent papers (Efron 2007, Leek and Storey 2007 and 2008; Friguet et al, 2009 ) have focused on the negative impact of data heterogeneity on the consistency of the ranking which results from multiple testing procedures. This presentation aims at showing that data heterogeneity also a effects the stability of supervised classification model selection which is often used to identify relevant subsets of features. Key characteristics of selection methods are both classification or prediction performance and reproducibility of the selected variables to perturbation in the data. It is first shown that selected subsets using well-known procedures such as LASSO (Tibshirani, 1996) are subject to a high variability. The stability of this selection method is compared through a simulation study, considering several scenario of dependence between variables: independence, block dependence, factor structure and Toeplitz design (as also considered in Meinshausen and Buhlmann, 2010). Simulation studies show that most usual methods do not select theoretical best predictors and that interesting performances of classification are performed only when a high number of variables are selected. As suggested in Friguet et al. (2009), a supervised factor model is proposed to identify a low-dimensional linear kernel which captures data dependence and new strategies for model selection are deduced. This new strategy is finally shown to improve stability of the usual methods. Indeed, interesting performances of classification are reached for a smaller number of selected variables and best theoretical predictors are more often selected for structures with a high degree of dependence.
Type de document :
Communication dans un congrès
Statistical Methods for (post)-Genomics Data (SMPGD2013), Jan 2013, Amsterdam, Netherlands
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-00913709
Contributeur : Chloé Friguet <>
Soumis le : mercredi 4 décembre 2013 - 11:59:27
Dernière modification le : samedi 23 septembre 2017 - 01:11:38

Identifiants

  • HAL Id : hal-00913709, version 1

Citation

Emeline Perthame, Chloé Friguet. Stability of model selection for high-dimensional data. Statistical Methods for (post)-Genomics Data (SMPGD2013), Jan 2013, Amsterdam, Netherlands. 〈hal-00913709〉

Partager

Métriques

Consultations de la notice

216