VARCLUST: clustering variables using dimensionality reduction

Piotr Sobczyk; Stanislaw Wilczynski; Malgorzata Bogdan; Piotr Graczyk; Julie Josse; Fabien Panloup; Valerie Seegers; Mateusz Staniak

Pré-Publication, Document De Travail Année : 2020

VARCLUST: clustering variables using dimensionality reduction

(1) , (2) , (3) , (4) , (5, 6) , (4, 7) , (7) , (3)

1
2
3
4
5
6
7

Piotr Sobczyk

Fonction : Auteur
PersonId : 1081444

Institute of Mathematics [Wrocław]

Stanislaw Wilczynski

Fonction : Auteur
PersonId : 1081445

Microsoft Development Center Norway [Oslo]

Malgorzata Bogdan

Fonction : Auteur

Institute of Mathematics [University of Wroclaw]

Piotr Graczyk

Fonction : Auteur
PersonId : 904760

Laboratoire Angevin de Recherche en Mathématiques

Julie Josse

Fonction : Auteur
PersonId : 993919

École polytechnique

Inria Sophia Antipolis - Méditerranée

Fabien Panloup

Fonction : Auteur
PersonId : 1021764

Laboratoire Angevin de Recherche en Mathématiques

Institut de Cancérologie de l'Ouest [Angers/Nantes]

Valerie Seegers

Fonction : Auteur
PersonId : 1081446

Institut de Cancérologie de l'Ouest [Angers/Nantes]

Mateusz Staniak

Fonction : Auteur
PersonId : 1081447

Institute of Mathematics [University of Wroclaw]

Résumé

VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed as the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the 'cluster dimensions', i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The "optimal" model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm of [29, 7] which was dedicated to subspaces of dimension only 1, and which is similar in structure to the K-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experi-mentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL) of [24], whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data, which show that the algorithm can lead to meaningful clustering. The proposed method is implemented in the publicly available R package varclust. Keywords variable clustering · Bayesian approach · k-means · dimensionality reduction · subspace clustering 2 P. Sobczyk, S. Wilczyński, M. Bogdan et al.

Domaines

Statistiques [math.ST] Probabilités [math.PR]

Fichier principal

varclust_submitted.pdf (3.22 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Fabien Panloup : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03002017

Soumis le : jeudi 12 novembre 2020-16:10:30

Dernière modification le : mercredi 3 avril 2024-13:04:02

Archivage à long terme le : samedi 13 février 2021-19:59:15

Dates et versions

hal-03002017 , version 1 (12-11-2020)

hal-03002017 , version 2 (18-12-2020)

Identifiants

HAL Id : hal-03002017 , version 1

Citer

Piotr Sobczyk, Stanislaw Wilczynski, Malgorzata Bogdan, Piotr Graczyk, Julie Josse, et al.. VARCLUST: clustering variables using dimensionality reduction. 2020. ⟨hal-03002017v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

131 Consultations

198 Téléchargements

VARCLUST: clustering variables using dimensionality reduction

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager