NbClust package: finding the relevant number of clusters in a dataset

Malika Charrad; Nadia Ghazzali; Véronique Boiteau; Azam Niknafs

Communication Dans Un Congrès Année : 2012

NbClust package: finding the relevant number of clusters in a dataset

(1) , , ,

Malika Charrad

Fonction : Auteur
PersonId : 1343595
IdHAL : malika-charrad
ORCID : 0009-0004-0360-5926

Centre d'études et de recherche en informatique et communications

Nadia Ghazzali

Fonction : Auteur
PersonId : 964680

Véronique Boiteau

Fonction : Auteur

Azam Niknafs

Fonction : Auteur

Résumé

Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. In general terms, there are three approaches to investigate cluster validity. The first is based on external criteria, which consist in comparing the results of cluster analysis to externally known results, such as externally provided class labels. The second approach is based on internal criteria which use the information obtained from within the clustering process to evaluate how well the results of cluster analysis fit the data without reference to external information. The third approach of clustering validity is based on relative criteria. Here the basic idea is the evaluation of a clustering structure by comparing it with other clustering schemes, resulting by the same algorithm but with different parameters values, e.g. the number of clusters. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. Although a vast number of references exist, few comparative studies have been performed on these indices (Milligan and Cooper,1985). Moreover, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package, NbClust, has been developped specifically for that purpose. It implements 30 indices for cluster validation ready to apply on outputs produced by clustering algorithms, Hierarchical clustering and Kmeans, coming from the same package. Most of these indices are described in Milligan and Cooper study (Milligan and Cooper, 1985). The NbClust function allows to apply one or 30 indices simultaneously and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures ("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), and clustering methods ("ward", "single", "complete", "average", "mcquitty", "median", "centroid").

Mots clés

Number of clusters Validity Indices Cluster validity Hierarchical clustering

Nombre de clusters Indices de validité Validité du cluster KMeans Hiérarchique de clustering

Domaines

Informatique [cs] Statistiques [math.ST]

Laboratoire CEDRIC : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01126149

Soumis le : vendredi 6 mars 2015-11:43:29

Dernière modification le : lundi 22 avril 2024-11:23:20

Dates et versions

hal-01126149 , version 1 (06-03-2015)

Identifiants

HAL Id : hal-01126149 , version 1

Citer

Malika Charrad, Nadia Ghazzali, Véronique Boiteau, Azam Niknafs. NbClust package: finding the relevant number of clusters in a dataset. UseR! 2012, Jun 2012, Nashville, United States. ⟨hal-01126149⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNAM CEDRIC-CNAM HESAM

125 Consultations

0 Téléchargements

NbClust package: finding the relevant number of clusters in a dataset

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager