Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2018

Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications

Résumé

The European "Tenders Electronic Daily" (TED) is a large source of semi-structured and multilingual data that is very valuable to the Natural Language Processing community. This data sets can effectively be used to address complex machine translation, multilingual terminology extraction, text-mining, or to benchmark information retrieval systems. Despite of the services offered by the user-friendliness of the web site that is made available to the public to access the publishing of the EU call for tenders, collecting and managing such kind of data is a great burden and consumes a lot of time and computing resources. This could explain why such a resource is not very (if any) exploited today by computer scientists or engineers in NLP. The aim of this paper is to describe two documented and easy-to-use multilingual corpora (one of them is a parallel corpus), extracted from the TED web source that we will release for the benefit of the NLP community.
Fichier principal
Vignette du fichier
LREC-832.pdf (76.85 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01865091 , version 1 (30-08-2018)

Identifiants

  • HAL Id : hal-01865091 , version 1

Citer

Oussama Ahmia, Nicolas Béchet, Pierre-François Marteau. Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Myazaki, Japan. ⟨hal-01865091⟩
235 Consultations
181 Téléchargements

Partager

Gmail Facebook X LinkedIn More