Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora

Résumé

Comparable corpora are the main alternative to the use of parallel corpora to extract bilingual lexicons. Although it is easier to build comparable corpora, specialized comparable corpora are often of modest size in comparison with corpora issued from the general domain. Consequently, the observations of word co-occurrences which are the basis of context-based methods are unreliable. We propose in this article to improve word co-occurrences of specialized comparable corpora and thus context representation by using general-domain data. This idea, which has been already used in machine translation task for more than a decade, is not straightforward for the task of bilingual lexicon extraction from specific-domain comparable corpora. We go against the mainstream of this task where many studies support the idea that adding out-of-domain documents decreases the quality of lexicons. Our empirical evaluation shows the advantages of this approach which induces a significant gain in the accuracy of extracted lexicons.
Fichier principal
Vignette du fichier
C16-1321.pdf (135.48 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02001789 , version 1 (09-01-2020)

Identifiants

  • HAL Id : hal-02001789 , version 1

Citer

Amir Hazem, Emmanuel Morin. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. 26th International Conference on Computational Linguistics (COLING), Dec 2016, Osaka, Japan. pp.3401-3411. ⟨hal-02001789⟩
152 Consultations
59 Téléchargements

Partager

Gmail Facebook X LinkedIn More