HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Traitement des Mots Hors Vocabulaire pour la Traduction Automatique de Document OCRisés en Arabe

Abstract : This article presents a new system that automatically translates images of arabic documents. Two modules are involved: an optical character recognition (OCR) module in Arabic and an Arabic-French machine translation module (MT). The OCR-MT coupling has not been much studied in the literature previously and the originality of this work consists in proposing a close coupling between OCR and MT as well as a specific processing of out-of-vocabulary (OOV) words due to OCR errors. The OCR-MT coupling based on an hypothesis lattice, as well as our OOV processing by replacement (according to a composite measure that takes into account surface form and context of the word) allow a significant improvement in translation performance. Our experiments are carried out on a challenging corpus of arabic newspapers digitized and we obtain BLEU improvements of 3,73 and 5,5 on our development and test corpora respectively.
Document type :
Conference papers
Complete list of metadata

Cited literature [36 references]  Display  Hide  Download

Contributor : Benjamin Lecouteux Connect in order to contact the contributor
Submitted on : Thursday, November 9, 2017 - 1:40:59 PM
Last modification on : Wednesday, November 3, 2021 - 6:47:14 AM
Long-term archiving on: : Saturday, February 10, 2018 - 3:06:17 PM


Publisher files allowed on an open archive


  • HAL Id : hal-01623072, version 1


Kamel Bouzidi, Zied Elloumi, Laurent Besacier, Benjamin Lecouteux, Mohamed Faouzi Benzeghiba. Traitement des Mots Hors Vocabulaire pour la Traduction Automatique de Document OCRisés en Arabe. TALN 2017, Jun 2017, Orléans, France. ⟨hal-01623072⟩



Record views


Files downloads