Beyond words: a comparative analysis of LLM embeddings for effective clustering - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2024

Beyond words: a comparative analysis of LLM embeddings for effective clustering

Résumé

The document clustering process involves the grouping of similar unlabeled textual documents. This task relies on the use of document embedding techniques, which can be derived from various models, including traditional and neural network-based approaches. The emergence of Large Language Models (LLMs) has provided a new method of capturing information from texts through customized numerical representations, potentially enhancing text clustering by identifying subtle semantic connections. The objective of this paper is to demonstrate the impact of LLMs of different sizes on text clustering. To accomplish this, we select five different LLMs and compare them with three less resourceintensive embedding methods. Additionally, we utilize six clustering algorithms. We simultaneously assess the performance of the embedding models and clustering algorithms in terms of clustering quality, and highlight the strengths and limitations of the models under investigation.
Fichier principal
Vignette du fichier
ida2024_LLM_paper.pdf (1.34 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04488175 , version 1 (04-03-2024)

Identifiants

  • HAL Id : hal-04488175 , version 1

Citer

Imed Keraghel, Stanislas Morbieu, Mohamed Nadif. Beyond words: a comparative analysis of LLM embeddings for effective clustering. 2024. ⟨hal-04488175⟩
159 Consultations
144 Téléchargements

Partager

Gmail Facebook X LinkedIn More