UFSAC: Unification of Sense Annotated Corpora and Tools

Abstract : In Word Sense Disambiguation, sense annotated corpora are often essential for evaluating a system and also valuable in order to reach a good efficiency. Always created for a specific purpose, there are today a dozen of sense annotated English corpora, in various formats and using different versions of WordNet. The main hypothesis of this work is that it should be possible to build a disambiguation system by using any of these corpora during the training phase or during the testing phase regardless of their original purpose. In this article, we present UFSAC: a format of corpus that can be used for either training or testing a disambiguation system, and the process we followed for constructing this format. We give to the community the whole set of sense annotated English corpora that we know, in this unified format, when the copyright allows it, with sense keys converted to the last version of WordNet. We also provide the source code for building these corpora from their original data, and a complete Java API for manipulating corpora in this format. The whole resource is available at the following URL: https://github.com/getalp/UFSAC.
Type de document :
Communication dans un congrès
Language Resources and Evaluation Conference (LREC), May 2018, Miyazaki, Japan
Liste complète des métadonnées

Littérature citée [37 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01718237
Contributeur : Benjamin Lecouteux <>
Soumis le : mardi 27 février 2018 - 11:19:12
Dernière modification le : jeudi 11 octobre 2018 - 08:48:03
Document(s) archivé(s) le : lundi 28 mai 2018 - 17:43:37

Fichier

LREC_2018(24).pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01718237, version 1

Collections

Citation

Loïc Vial, Benjamin Lecouteux, Didier Schwab. UFSAC: Unification of Sense Annotated Corpora and Tools. Language Resources and Evaluation Conference (LREC), May 2018, Miyazaki, Japan. 〈hal-01718237〉

Partager

Métriques

Consultations de la notice

146

Téléchargements de fichiers

140