Similarity Based Hierarchical Clustering with an Application to Text Collections - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Similarity Based Hierarchical Clustering with an Application to Text Collections

Résumé

Lance-Williams formula is a framework that unifies seven schemes of agglomerative hierarchical clustering. In this paper, we establish a new expression of this formula using cosine similarities instead of distances. We state conditions under which the new formula is equivalent to the original one. The interest of our approach is twofold. Firstly, we can naturally extend agglomerative hierarchical clustering techniques to kernel functions. Secondly, reasoning in terms of similarities allows us to design thresholding strategies on proximity values. Thereby, we propose to sparsify the similarity matrix in the goal of making these clustering techniques more efficient. We apply our approach to text clustering tasks. Our results show that sparsifying the inner product matrix considerably decreases memory usage and shortens running time while assuring the clustering quality.
Fichier principal
Vignette du fichier
paper_ida_16.pdf (470.79 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-01437124 , version 1 (10-04-2017)

Licence

Copyright (Tous droits réservés)

Identifiants

Citer

Julien Ah-Pine, Xinyu Wang. Similarity Based Hierarchical Clustering with an Application to Text Collections. Intelligent Data Analysis, Oct 2016, Stockholm, Sweden. pp.320 - 331, ⟨10.1007/978-3-319-46349-0_28⟩. ⟨hal-01437124⟩
185 Consultations
1845 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More