Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

Vincent Claveau 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : In this position paper, we review a problem very common for many NLP tasks: computing similarity (or distances) between texts. We aim at showing that what is often considered as a small component in a broader complex system is very often overlooked, leading to the use of sub-optimal solutions. Indeed, computing similarity with TF-IDF weighting and cosine is often presented as "state-of-theart", while more effective alternatives are in the Information Retrieval (IR) community. Through some experiments on several tasks, we show how this simple calculation of similarity can influence system performance. We consider two particular alternatives. The first is the weighting scheme Okapi-BM25, well known in IR and directly interchangeable with TF-IDF. The other, called vectorization, is a technique for calculating text similarities that we have developed which offers some interesting properties.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00760158
Contributor : Vincent Claveau <>
Submitted on : Tuesday, December 4, 2012 - 10:19:25 AM
Last modification on : Friday, November 16, 2018 - 1:24:27 AM
Long-term archiving on : Tuesday, March 5, 2013 - 3:49:30 AM

File

Claveau_position_TALN2012.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00760158, version 1

Citation

Vincent Claveau. Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF. TALN - Traitement Automatique des Langues Naturelles, Jun 2012, Grenoble, France. ⟨hal-00760158⟩

Share

Metrics

Record views

735

Files downloads

1654