Towards Accurate Predictors of Word Quality for Machine Translation: Lessons Learned on French - English and English - Spanish Systems

This paper proposes some ideas to build effective estimators, which predict the quality of words in a Machine Translation (MT) output. We propose a number of novel features of various types (system-based, lexical, syntactic and semantic) and then integrate them into the conventional (previously used) feature set, for our baseline classifier training. The classifiers are built over two different bilingual corpora: French–English (fr–en) and English–Spanish (en–es). After the experiments with all features, we deploy a " Feature Selection " strategy to filter the best performing ones. Then, a method that combines multiple " weak " classifiers to constitute a strong " composite " classifier by taking advantage of their complementarity allows us to achieve a significant improvement in terms of F-score, for both fr–en and en–es systems. Finally, we exploit word confidence scores for improving the quality estimation system at sentence level.

Mots clés

quality estimation word confidence estimation statistical machine translation

Domaines

Informatique et langage [cs.CL]

Fichier principal

DKE_final.pdf (456.97 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Laurent Besacier : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01147902

Soumis le : lundi 4 mai 2015-14:23:14

Dernière modification le : lundi 15 avril 2024-11:25:23

Archivage à long terme le : mercredi 19 avril 2017-12:34:34

Dates et versions

hal-01147902 , version 1 (04-05-2015)

Identifiants

HAL Id : hal-01147902 , version 1
DOI : 10.1016/j.datak.2015.04.003

Citer

Ngoc-Quang Luong, Laurent Besacier, Benjamin Lecouteux. Towards Accurate Predictors of Word Quality for Machine Translation: Lessons Learned on French - English and English - Spanish Systems. Data and Knowledge Engineering, 2015, pp.11. ⟨10.1016/j.datak.2015.04.003⟩. ⟨hal-01147902⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP POLYTECH-GRENOBLE LIG_SIDCH

188 Consultations

265 Téléchargements