Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content

Alain Lelu

Communication Dans Un Congrès Année : 2010

Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content

(1, 2)

1
2

Alain Lelu

Fonction : Auteur
PersonId : 844123

Knowledge Information and Web Intelligence

Laboratoire de Semio-Linguistique, Didactique et Informatique

Résumé

Determining the number of relevant dimensions in the eigen-space of a data matrix is a central issue in many data-mining applications. We tackle here the sub-problem of finding the ''right'' dimensionality of a type of data matrices often encountered in the domains of text or usage mining: large, sparse, high-dimensional binary datatables. We present here the application of a randomization test to this problem. We validate our approach first on artificial datasets, then on a real documentary data collection, i.e. 1900 documents described in a 3600 keywords dataspace, where the actual, intrinsic dimension appears to be 28 times less than the number of keywords - an important information when preparing to cluster or discriminate such data. We also present preliminary results on the problem of clearing the datatable from non-essential information bits.

Mots clés

randomization test dimensionality reduction data reconstitution power-law distribution

Domaines

Méthodologie [stat.ME]

Fichier principal

lelu_COMPSTAT2010ccccc.pdf (218.98 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Alain Lelu : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00462464

Soumis le : jeudi 24 novembre 2011-09:49:22

Dernière modification le : jeudi 13 avril 2023-09:26:12

Archivage à long terme le : vendredi 16 novembre 2012-11:55:10

Dates et versions

hal-00462464 , version 1 (24-11-2011)

Identifiants

HAL Id : hal-00462464 , version 1

Citer

Alain Lelu. Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content. 19th International Conference on Computational Statistics - COMPSTAT 2010, Aug 2010, Paris, France. pp.1271-1278. ⟨hal-00462464⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-FCOMTE UNIV-LORRAINE LORIA ELLIADD

163 Consultations

59 Téléchargements

Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager