Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications

Oussama Ahmia; Nicolas Béchet; Pierre-François Marteau

Communication Dans Un Congrès Année : 2018

Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications

(1) , (1) , (1)

Oussama Ahmia

Fonction : Auteur

Expressiveness in Human Centered Data/Media

Nicolas Béchet

Fonction : Auteur
PersonId : 181774
IdHAL : nicolas-bechet
ORCID : 0000-0001-9425-5570
IdRef : 142928879

Expressiveness in Human Centered Data/Media

Pierre-François Marteau

Fonction : Auteur
PersonId : 219
IdHAL : pierre-francois-marteau
ORCID : 0000-0002-3963-8795
IdRef : 033981124

Expressiveness in Human Centered Data/Media

Résumé

The European "Tenders Electronic Daily" (TED) is a large source of semi-structured and multilingual data that is very valuable to the Natural Language Processing community. This data sets can effectively be used to address complex machine translation, multilingual terminology extraction, text-mining, or to benchmark information retrieval systems. Despite of the services offered by the user-friendliness of the web site that is made available to the public to access the publishing of the EU call for tenders, collecting and managing such kind of data is a great burden and consumes a lot of time and computing resources. This could explain why such a resource is not very (if any) exploited today by computer scientists or engineers in NLP. The aim of this paper is to describe two documented and easy-to-use multilingual corpora (one of them is a parallel corpus), extracted from the TED web source that we will release for the benefit of the NLP community.

Mots clés

Multilingual corpora Parallel Corpus Call for Tender European Languages Natural Language Resource

Domaines

Recherche d'information [cs.IR] Traitement du texte et du document

Fichier principal

LREC-832.pdf (76.85 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre-François Marteau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01865091

Soumis le : jeudi 30 août 2018-18:47:37

Dernière modification le : vendredi 24 mars 2023-14:53:08

Dates et versions

hal-01865091 , version 1 (30-08-2018)

Identifiants

HAL Id : hal-01865091 , version 1

Citer

Oussama Ahmia, Nicolas Béchet, Pierre-François Marteau. Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Myazaki, Japan. ⟨hal-01865091⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC IRISA-D6 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

235 Consultations

181 Téléchargements

Two Multilingual Corpora Extracted from the Tenders Electronic Daily for Machine Learning and Machine Translation Applications

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager