Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

Vincent Claveau

Résumé

In this position paper, we review a problem very common for many NLP tasks: computing similarity (or distances) between texts. We aim at showing that what is often considered as a small component in a broader complex system is very often overlooked, leading to the use of sub-optimal solutions. Indeed, computing similarity with TF-IDF weighting and cosine is often presented as "state-of-theart", while more effective alternatives are in the Information Retrieval (IR) community. Through some experiments on several tasks, we show how this simple calculation of similarity can influence system performance. We consider two particular alternatives. The first is the weighting scheme Okapi-BM25, well known in IR and directly interchangeable with TF-IDF. The other, called vectorization, is a technique for calculating text similarities that we have developed which offers some interesting properties.
Fichier principal
Vignette du fichier
Claveau_position_TALN2012.pdf (208.36 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00760158 , version 1 (04-12-2012)

Identifiants

  • HAL Id : hal-00760158 , version 1

Citer

Vincent Claveau. Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF. TALN - Traitement Automatique des Langues Naturelles, Jun 2012, Grenoble, France. ⟨hal-00760158⟩
725 Consultations
1844 Téléchargements

Partager

Gmail Facebook X LinkedIn More