Corpus-Based methods for Short Text Similarity

Prajol Shrestha

Communication Dans Un Congrès Rencontre des Étudiants Chercheurs en Informatique pour le Traitement automatique des Langues Année : 2011

Corpus-Based methods for Short Text Similarity

(1)

Prajol Shrestha

Fonction : Auteur
PersonId : 905891

Laboratoire d'Informatique de Nantes Atlantique

Résumé

This paper presents corpus-based methods to find similarity between short text (sentences, paragraphs, ...) which has many applications in the field of NLP. Previous works on this problem have been based on supervised methods or have used external resources such as WordNet, British National Corpus etc. Our methods are focused on unsupervised corpus-based methods. We present a new method, based on Vector Space Model, to capture the contextual behavior, senses and correlation, of terms and show that this method performs better than the baseline method that uses vector based cosine similarity measure. The performance of existing document similarity measures, Dice and Resemblance, are also evaluated which in our knowledge have not been used for short text similarity. We also show that the performance of the vector-based baseline method is improved when using stems instead of words and using the candidate sentences for computing the parameters rather than some external resource.

Mots clés

Similarity Vector Space Model Similarity metric

Domaines

Informatique et langage [cs.CL]

Fichier principal

taln11_submission_116.pdf (82.84 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Prajol Shrestha : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00609909

Soumis le : mercredi 20 juillet 2011-14:49:52

Dernière modification le : vendredi 5 janvier 2024-03:24:38

Archivage à long terme le : lundi 12 novembre 2012-11:20:48

Dates et versions

hal-00609909 , version 1 (20-07-2011)

Identifiants

HAL Id : hal-00609909 , version 1

Citer

Prajol Shrestha. Corpus-Based methods for Short Text Similarity. Rencontre des Étudiants Chercheurs en Informatique pour le Traitement automatique des Langues, Jun 2011, Montpellier, France. pp.297. ⟨hal-00609909⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-NANTES CNRS LINA LINA-TALN LS2N NANTES-UNIVERSITE

369 Consultations

2614 Téléchargements

Corpus-Based methods for Short Text Similarity

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager