Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

Vincent Claveau 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : In this position paper, we review a problem very common for many NLP tasks: computing similarity (or distances) between texts. We aim at showing that what is often considered as a small component in a broader complex system is very often overlooked, leading to the use of sub-optimal solutions. Indeed, computing similarity with TF-IDF weighting and cosine is often presented as "state-of-theart", while more effective alternatives are in the Information Retrieval (IR) community. Through some experiments on several tasks, we show how this simple calculation of similarity can influence system performance. We consider two particular alternatives. The first is the weighting scheme Okapi-BM25, well known in IR and directly interchangeable with TF-IDF. The other, called vectorization, is a technique for calculating text similarities that we have developed which offers some interesting properties.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download
Contributor : Vincent Claveau <>
Submitted on : Tuesday, December 4, 2012 - 10:19:25 AM
Last modification on : Friday, November 16, 2018 - 1:24:27 AM
Long-term archiving on : Tuesday, March 5, 2013 - 3:49:30 AM


Files produced by the author(s)


  • HAL Id : hal-00760158, version 1


Vincent Claveau. Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF. TALN - Traitement Automatique des Langues Naturelles, Jun 2012, Grenoble, France. ⟨hal-00760158⟩



Record views


Files downloads