MEDLINE as a parallel corpus: a survey to gain insight on French-, Spanish- and Portuguese-speaking authors' abstract writing practice - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

MEDLINE as a parallel corpus: a survey to gain insight on French-, Spanish- and Portuguese-speaking authors' abstract writing practice

Antonio Jimeno Yepes
  • Fonction : Auteur
  • PersonId : 1034914
Mariana L Neves
  • Fonction : Auteur

Résumé

Background: Parallel corpora are used to train and evaluate machine translation systems. To alleviate the cost of producing parallelresources for evaluation campaigns, existing corpora are leveraged. However, little information may be available about the methodsused for producing the corpus, including translation direction. Objective: To gain insight on MEDLINE parallel corpus used in thebiomedical task at the Workshop on Machine Translation in 2019 (WMT 2019). Material and Methods: Contact information for theauthors of MEDLINE articles included in the English/Spanish (EN/ES), English/French (EN/FR), and English/Portuguese (EN/PT)WMT 2019 test sets was obtained from PubMed and publisher websites. The authors were asked about their abstract writing practicesin a survey. Results: The response rate was above 20%. Authors reported that they are mainly native speakers of languages other thanEnglish. Although manual translation, sometimes via professional translation services, was commonly used for abstract translation,authors of articles in the EN/ES and EN/PT sets also relied on post-edited machine translation. Discussion: This study provides acharacterization of MEDLINE authors’ language skills and abstract writing practices. Conclusion: The information collected in thisstudy will be used to inform test set design for the next WMT biomedical task.
Fichier principal
Vignette du fichier
MEDLINE_AuthorSurvey_LREC_final5iii2020.pdf (161.89 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03023950 , version 1 (24-12-2020)

Identifiants

  • HAL Id : hal-03023950 , version 1

Citer

Aurélie Névéol, Antonio Jimeno Yepes, Mariana L Neves. MEDLINE as a parallel corpus: a survey to gain insight on French-, Spanish- and Portuguese-speaking authors' abstract writing practice. International Conference on Language Resources and Evaluation, ELRA, May 2020, Marseille, France. ⟨hal-03023950⟩
54 Consultations
106 Téléchargements

Partager

Gmail Facebook X LinkedIn More