Data Selection for Compact Adapted SMT Models

Shachar Mirkin; Laurent Besacier

Communication Dans Un Congrès Année : 2014

Data Selection for Compact Adapted SMT Models

(1, 2) , (2, 3)

1
2
3

Shachar Mirkin

Fonction : Auteur
PersonId : 966736

Xerox Research Centre Europe [Meylan]

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laurent Besacier

Fonction : Auteur
PersonId : 1521
IdHAL : laurent-besacier
ORCID : 0000-0001-7411-9125
IdRef : 079377017

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Institut universitaire de France

Résumé

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

Domaines

Informatique et langage [cs.CL]

Fichier principal

2014-046.pdf (220.16 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Laurent Besacier : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01350121

Soumis le : vendredi 29 juillet 2016-16:40:12

Dernière modification le : lundi 15 avril 2024-11:25:23

Dates et versions

hal-01350121 , version 1 (29-07-2016)

Identifiants

HAL Id : hal-01350121 , version 1

Citer

Shachar Mirkin, Laurent Besacier. Data Selection for Compact Adapted SMT Models. Eleventh Conference of the Association for Machine Translation in the Americas (AMTA), Oct 2014, Vancouver, Canada. ⟨hal-01350121⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP POLYTECH-GRENOBLE LIG_SIDCH

89 Consultations

87 Téléchargements

Data Selection for Compact Adapted SMT Models

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager