Skip to Main content Skip to Navigation
Book sections

Importance of Dataspace Embeddings when Evaluating Text Clustering Methods

Alain Lelu 1 Martine Cadot 2
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Fair evaluation of text clustering methods needs to clarify the relations between 1)pre-processing, resulting in raw term occurrence vectors, 2)data transformation, and 3)method in the strict sense. We have tried to empirically compare a dozen well-known methods and variants in a protocol crossing three contrasted open-access corpora in a few tens transformed dataspaces. We compared the resulting clusterings to their supposed "ground-truth" classes by means of four usual indices. The results show both a confirmation of well-established implicit combinations, and good performances of unexpected ones, mostly in spectral or kernel dataspaces. The rich material resulting from these some 600 runs includes a wealth of intriguing facts, which needs further research on the specificities of text corpora in relation to methods and dataspaces.
Complete list of metadata
Contributor : Martine Cadot <>
Submitted on : Monday, December 14, 2020 - 1:23:28 PM
Last modification on : Tuesday, December 15, 2020 - 3:56:16 AM


 Restricted access
To satisfy the distribution rights of the publisher, the document is embargoed until : 2021-06-07

Please log in to resquest access to the document


  • HAL Id : hal-03053176, version 2


Alain Lelu, Martine Cadot. Importance of Dataspace Embeddings when Evaluating Text Clustering Methods. Data Analysis and Rationality in a Complex World, In press. ⟨hal-03053176v2⟩



Record views