Neural machine translation, corpus and frugality

Raoul Blin

Pré-Publication, Document De Travail Année : 2021

Neural machine translation, corpus and frugality

(1, 2)

1
2

Raoul Blin

Fonction : Auteur
PersonId : 6647
IdHAL : r-blin
IdRef : 060506814

Centre National de la Recherche Scientifique

Centre de Recherches Linguistiques sur l'Asie Orientale

Résumé

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.

Domaines

Linguistique

Raoul Blin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03123565

Soumis le : jeudi 28 janvier 2021-09:07:13

Dernière modification le : lundi 18 mars 2024-10:24:06

Dates et versions

hal-03123565 , version 1 (28-01-2021)

Identifiants

HAL Id : hal-03123565 , version 1
ARXIV : 2101.10650

Citer

Raoul Blin. Neural machine translation, corpus and frugality. 2021. ⟨hal-03123565⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS EHESS INALCO CRLAO CAMPUS-AAR AAI CAMPUS-CONDORCET ASIES_ET_PACIFIQUE

49 Consultations

0 Téléchargements

Neural machine translation, corpus and frugality

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager