An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation

Abstract : Recently, a growing need of Confidence Estimation (CE) for Statistical Machine Translation (SMT) systems in Computer Aided Translation (CAT), was observed. However, most of the CE toolkits are optimized for a single target language (mainly English) and, as far as we know, none of them are dedicated to this specific task and freely available. This paper presents an open-source toolkit for predicting the quality of words of a SMT output, whose novel contributions are (i) support for various target languages, (ii) handle a number of features of different types (system-based, lexical , syntactic and semantic). In addition, the toolkit also integrates a wide variety of Natural Language Processing or Machine Learning tools to pre-process data, extract features and estimate confidence at word-level. Features for Word-level Confidence Estimation (WCE) can be easily added / removed using a configuration file. We validate the toolkit by experimenting in the WCE evaluation framework of WMT shared task with two language pairs: French-English and English-Spanish. The toolkit is made available to the research community with ready-made scripts to launch full experiments on these language pairs, while achieving state-of-the-art and reproducible performances.
Type de document :
Communication dans un congrès
The 12th International Workshop on Spoken Language Translation (IWSLT'15), Dec 2015, Da Nang, Vietnam. 2015, 〈http://workshop2015.iwslt.org/〉
Liste complète des métadonnées

Littérature citée [33 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01244477
Contributeur : Christophe Servan <>
Soumis le : mardi 15 décembre 2015 - 18:46:50
Dernière modification le : jeudi 11 octobre 2018 - 08:48:03
Document(s) archivé(s) le : mercredi 16 mars 2016 - 16:20:58

Fichier

WCE_iwslt15.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : hal-01244477, version 1

Collections

Citation

Christophe Servan, Ngoc-Tien Le, Ngoc Quang Luong, Benjamin Lecouteux, Laurent Besacier. An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation. The 12th International Workshop on Spoken Language Translation (IWSLT'15), Dec 2015, Da Nang, Vietnam. 2015, 〈http://workshop2015.iwslt.org/〉. 〈hal-01244477〉

Partager

Métriques

Consultations de la notice

444

Téléchargements de fichiers

183