Neural machine translation, corpus and frugality - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2021

Neural machine translation, corpus and frugality

Résumé

In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.

Domaines

Linguistique

Dates et versions

hal-03123565 , version 1 (28-01-2021)

Identifiants

Citer

Raoul Blin. Neural machine translation, corpus and frugality. 2021. ⟨hal-03123565⟩
49 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More