Comparing two clusterings using matchings between clusters of clusters

Frédéric Cazals; Dorian Mazauric; Romain Tetley; Rémi Watrigant

Rapport (Rapport De Recherche) Année : 2017

Comparing two clusterings using matchings between clusters of clusters

(1) , (1, 2) , (1) , (1, 2)

1
2

Frédéric Cazals

Fonction : Auteur
PersonId : 1189617
ORCID : 0000-0003-2735-6755
IdRef : 094973881

Algorithms, Biology, Structure

Dorian Mazauric

Fonction : Auteur
PersonId : 6470
IdHAL : dorian-mazauric
IdRef : 157983455

Algorithms, Biology, Structure

COMUE Université Côte d'Azur (2015-2019)

Romain Tetley

Fonction : Auteur

Algorithms, Biology, Structure

Rémi Watrigant

Fonction : Auteur
PersonId : 171721
IdHAL : remi-watrigant
ORCID : 0000-0002-6243-5910
IdRef : 199341230

Algorithms, Biology, Structure

COMUE Université Côte d'Azur (2015-2019)

Résumé

Clustering is a fundamental problem in data science, yet, the variety of clustering methods and their sensitivity to parameters make clustering hard. To analyze the stability of a given clustering algorithm while varying its parameters, and to compare clusters yielded by different algorithms, several comparison schemes based on matchings, information theory and various indices (Rand, Jaccard) have been developed. We go beyond these by providing a novel class of methods computing meta-clusters within each clustering-- a meta-cluster is a group of clusters, together with a matching between these. Altogether, these pieces of information help assessing the coherence between two clusterings. More specifically, let the intersection graph of two clusterings be the edge-weighted bipartite graph in which the nodes represent the clusters, the edges represent the non empty intersection between two clusters, and the weight of an edge is the number of common items. We introduce the so-called D-family-matching problem on intersection graphs, with D the upper-bound on the diameter of the graph induced by the clusters of any meta-cluster. First we prove NP-completeness results and unbounded approximation ratio of simple strategies. Second, we design exact polynomial time dynamic programming algorithms for some classes of graphs (in particular trees). Then, we prove efficient algorithms, based on spanning trees, for general graphs. Practically, we illustrate the ability of our algorithms to identify relevant meta-clusters between a given clustering and an edited version of it. By comparing our scores against the Variation of Information, we also show the insights yielded by parameter D.

Mots clés

dynamic programming algorithms Clustering stability comparison of clusterings graph decomposition NP-completeness

Domaines

Algorithme et structure de données [cs.DS]

Fichier principal

RR-9063-family-matching.pdf (3.33 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Dorian Mazauric : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01514872

Soumis le : vendredi 29 septembre 2017-08:37:44

Dernière modification le : jeudi 15 février 2024-15:28:00

Archivage à long terme le : samedi 30 décembre 2017-12:44:46

Dates et versions

hal-01514872 , version 1 (26-04-2017)

hal-01514872 , version 2 (29-09-2017)

hal-01514872 , version 3 (01-02-2019)

hal-01514872 , version 4 (16-07-2019)

Identifiants

HAL Id : hal-01514872 , version 2

Citer

Frédéric Cazals, Dorian Mazauric, Romain Tetley, Rémi Watrigant. Comparing two clusterings using matchings between clusters of clusters. [Research Report] RR-9063, INRIA Sophia Antipolis - Méditerranée; Universite Cote d'Azur. 2017. ⟨hal-01514872v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

541 Consultations

4402 Téléchargements

Comparing two clusterings using matchings between clusters of clusters

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager