Sélection de mesures de similarité pour les données catégorielles

Guilherme Alves 1 Miguel Couceiro 1 Amedeo Napoli 1
1 ORPAILLEUR - Knowledge representation, reasonning
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Data clustering is a well-known task in data mining and it often relies on distances or, in some cases, similarity measures. The latter is indeed the case for real world datasets that comprise categorical attributes. Several similarity measures have been proposed in the literature, however, their choice depends on the context and the dataset at hand. In this paper, we address the following question: given a set of measures, which one is best suited for clustering a particular dataset? We propose an approach to automate this choice, and we present an empirical study based on categorical datasets, on which we evaluate our proposed approach.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02410221
Contributor : Guilherme Alves <>
Submitted on : Friday, December 13, 2019 - 5:23:17 PM
Last modification on : Wednesday, January 8, 2020 - 1:36:30 PM

File

ga-etal-egcf-2020.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02410221, version 1

Collections

Citation

Guilherme Alves, Miguel Couceiro, Amedeo Napoli. Sélection de mesures de similarité pour les données catégorielles. 20ème édition de la conférence Extraction et Gestion des Connaissances (EGC), Jan 2020, Bruxelles, Belgique. ⟨hal-02410221⟩

Share

Metrics

Record views

35

Files downloads

46