Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

Vincent Claveau

Communication Dans Un Congrès Année : 2012

Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

(1)

Vincent Claveau

Fonction : Auteur
PersonId : 5270
IdHAL : vincent-claveau
ORCID : 0000-0002-3459-0550
IdRef : 075988216

Multimedia content-based indexing

Résumé

In this position paper, we review a problem very common for many NLP tasks: computing similarity (or distances) between texts. We aim at showing that what is often considered as a small component in a broader complex system is very often overlooked, leading to the use of sub-optimal solutions. Indeed, computing similarity with TF-IDF weighting and cosine is often presented as "state-of-theart", while more effective alternatives are in the Information Retrieval (IR) community. Through some experiments on several tasks, we show how this simple calculation of similarity can influence system performance. We consider two particular alternatives. The first is the weighting scheme Okapi-BM25, well known in IR and directly interchangeable with TF-IDF. The other, called vectorization, is a technique for calculating text similarities that we have developed which offers some interesting properties.

Domaines

Recherche d'information [cs.IR] Informatique et langage [cs.CL]

Fichier principal

Claveau_position_TALN2012.pdf (208.36 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Vincent Claveau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00760158

Soumis le : mardi 4 décembre 2012-10:19:25

Dernière modification le : vendredi 24 mars 2023-14:52:56

Archivage à long terme le : mardi 5 mars 2013-03:49:30

Dates et versions

hal-00760158 , version 1 (04-12-2012)

Identifiants

HAL Id : hal-00760158 , version 1

Citer

Vincent Claveau. Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF. TALN - Traitement Automatique des Langues Naturelles, Jun 2012, Grenoble, France. ⟨hal-00760158⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D6 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE UR1-MATH-NUM

725 Consultations

1842 Téléchargements

Vectorisation, Okapi et calcul de similarité pour le TAL : pour oublier enfin le TF-IDF

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager