Stability of feature selection in classification issues for high-dimensional correlated data

Handling dependence or not in feature selection is still an open question in supervised classification issues where the number of covariates exceeds the number of observations. Some recent papers surprisingly show the superiority of naive Bayes approaches based on an obviously erroneous assumption of independence, whereas others recommend to infer on the dependence structure in order to decorrelate the selection statistics. In the classical linear discriminant analysis (LDA) framework, the present paper first highlights the impact of dependence in terms of instability of feature selection. A second objective is to revisit the above issue using a flexible factor modeling for the covariance. This framework introduces latent components of dependence, conditionally on which a new Bayes consistency is defined. A procedure is then proposed for the joint estimation of the expectation and variance parameters of the model. The present method is compared to recent regularized diagonal discriminant analysis approaches, assuming independence among features, and regularized LDA procedures, both in terms of classification performance and stability of feature selection. The proposed method is implemented in the R package FADA, freely available from the R repository CRAN.

Mots clés

Stability Classification Discriminant Analysis Variable selection High dimension

Domaines

Statistiques [stat] Applications [stat.AP] Méthodologie [stat.ME]

Fichier principal

Stability of feature selection in classification issues.pdf (663.92 Ko)

Origine : Publication financée par une institution

Chloé Friguet : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01256508

Soumis le : jeudi 4 juin 2020-17:43:54

Dernière modification le : mardi 19 décembre 2023-09:47:26

Archivage à long terme le : jeudi 3 décembre 2020-14:20:28

Dates et versions

hal-01256508 , version 1 (04-06-2020)

Licence

Paternité

Identifiants

HAL Id : hal-01256508 , version 1
DOI : 10.1007/s11222-015-9569-2

Citer

Emeline Perthame, Chloé Friguet, David Causeur. Stability of feature selection in classification issues for high-dimensional correlated data. Statistics and Computing, 2016, 26 (4), pp.783-796. ⟨10.1007/s11222-015-9569-2⟩. ⟨hal-01256508⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-RENNES1 IRMAR UR2-HB CNRS INRIA INSA-RENNES IRISA UNAM IRMAR-STAT UBS IRISA_UBS CHL CENTRALESUPELEC IRISA-D5 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES2 UNIV-RENNES INSA-GROUPE ANR UR1-MATH-NUM INSTITUT-AGRO-RENNES-ANGERS-UMR-IRMAR INSTITUT-AGRO-RENNES-ANGERS

391 Consultations

110 Téléchargements