Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content

Résumé

Determining the number of relevant dimensions in the eigen-space of a data matrix is a central issue in many data-mining applications. We tackle here the sub-problem of finding the ''right'' dimensionality of a type of data matrices often encountered in the domains of text or usage mining: large, sparse, high-dimensional binary datatables. We present here the application of a randomization test to this problem. We validate our approach first on artificial datasets, then on a real documentary data collection, i.e. 1900 documents described in a 3600 keywords dataspace, where the actual, intrinsic dimension appears to be 28 times less than the number of keywords - an important information when preparing to cluster or discriminate such data. We also present preliminary results on the problem of clearing the datatable from non-essential information bits.
Fichier principal
Vignette du fichier
lelu_COMPSTAT2010ccccc.pdf (218.98 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-00462464 , version 1 (24-11-2011)

Identifiants

  • HAL Id : hal-00462464 , version 1

Citer

Alain Lelu. Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content. 19th International Conference on Computational Statistics - COMPSTAT 2010, Aug 2010, Paris, France. pp.1271-1278. ⟨hal-00462464⟩
163 Consultations
59 Téléchargements

Partager

Gmail Facebook X LinkedIn More