Traitement des Mots Hors Vocabulaire pour la Traduction Automatique de Document OCRisés en Arabe

Abstract : This article presents a new system that automatically translates images of arabic documents. Two modules are involved: an optical character recognition (OCR) module in Arabic and an Arabic-French machine translation module (MT). The OCR-MT coupling has not been much studied in the literature previously and the originality of this work consists in proposing a close coupling between OCR and MT as well as a specific processing of out-of-vocabulary (OOV) words due to OCR errors. The OCR-MT coupling based on an hypothesis lattice, as well as our OOV processing by replacement (according to a composite measure that takes into account surface form and context of the word) allow a significant improvement in translation performance. Our experiments are carried out on a challenging corpus of arabic newspapers digitized and we obtain BLEU improvements of 3,73 and 5,5 on our development and test corpora respectively.
Document type :
Conference papers
Liste complète des métadonnées

Cited literature [36 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01623072
Contributor : Benjamin Lecouteux <>
Submitted on : Thursday, November 9, 2017 - 1:40:59 PM
Last modification on : Tuesday, April 2, 2019 - 1:47:43 AM
Document(s) archivé(s) le : Saturday, February 10, 2018 - 3:06:17 PM

File

papier_kamel.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-01623072, version 1

Collections

Citation

Kamel Bouzidi, Zied Elloumi, Laurent Besacier, Benjamin Lecouteux, Mohamed Faouzi Benzeghiba. Traitement des Mots Hors Vocabulaire pour la Traduction Automatique de Document OCRisés en Arabe. TALN 2017, Jun 2017, Orléans, France. ⟨hal-01623072⟩

Share

Metrics

Record views

185

Files downloads

324