Low-resource OCR error detection and correction in French Clinical Texts

Eva d'Hondt; Cyril Grouin; Brigitte Grau

Communication Dans Un Congrès Année : 2016

Low-resource OCR error detection and correction in French Clinical Texts

, (1) , (1)

Eva d'Hondt

Fonction : Auteur
PersonId : 1021926

Cyril Grouin

Fonction : Auteur
PersonId : 177247
IdHAL : cyril-grouin
ORCID : 0000-0001-5809-188X
IdRef : 163639132

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Brigitte Grau

Fonction : Auteur
PersonId : 177137
IdHAL : brigitte-grau

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Résumé

In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these are not always available given the constraints placed on using medical corpora. We therefore propose a novel method that only needs a representative corpus of acceptable OCR quality in order to train models. Our method uses recurrent neural networks (RNNs) to model sequential information on character level for a given medical text corpus. By inserting noise during the training process we can simultaneously learn the underlying (character-level) language model and as well as learning to detect and eliminate random noise from the textual input. The resulting models are robust to the variability of OCR quality but do not require additional, external information such as lexicons. We compare two different ways of injecting noise into the training process and evaluate our models on a manually corrected data set. We find that the best performing system achieves a 73% accuracy.

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Fichier principal

emnlp-louhi2016.pdf (153.35 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Limsi Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01831225

Soumis le : mercredi 11 septembre 2019-11:00:19

Dernière modification le : samedi 7 octobre 2023-21:36:20

Archivage à long terme le : samedi 8 février 2020-00:20:57

Dates et versions

hal-01831225 , version 1 (11-09-2019)

Identifiants

HAL Id : hal-01831225 , version 1

Citer

Eva d'Hondt, Cyril Grouin, Brigitte Grau. Low-resource OCR error detection and correction in French Clinical Texts. International Workshop on Health Text Mining and Information Analysis, ACL, Nov 2016, Austin, United States. ⟨hal-01831225⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIMSI UNIV-PARIS-SACLAY SORBONNE-UNIVERSITE LISN GS-ENGINEERING GS-COMPUTER-SCIENCE

32 Consultations

66 Téléchargements

Low-resource OCR error detection and correction in French Clinical Texts

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager