Low-resource OCR error detection and correction in French Clinical Texts - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Low-resource OCR error detection and correction in French Clinical Texts

Résumé

In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training material, these are not always available given the constraints placed on using medical corpora. We therefore propose a novel method that only needs a representative corpus of acceptable OCR quality in order to train models. Our method uses recurrent neural networks (RNNs) to model sequential information on character level for a given medical text corpus. By inserting noise during the training process we can simultaneously learn the underlying (character-level) language model and as well as learning to detect and eliminate random noise from the textual input. The resulting models are robust to the variability of OCR quality but do not require additional, external information such as lexicons. We compare two different ways of injecting noise into the training process and evaluate our models on a manually corrected data set. We find that the best performing system achieves a 73% accuracy.
Fichier principal
Vignette du fichier
emnlp-louhi2016.pdf (153.35 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01831225 , version 1 (11-09-2019)

Identifiants

  • HAL Id : hal-01831225 , version 1

Citer

Eva d'Hondt, Cyril Grouin, Brigitte Grau. Low-resource OCR error detection and correction in French Clinical Texts. International Workshop on Health Text Mining and Information Analysis, ACL, Nov 2016, Austin, United States. ⟨hal-01831225⟩
32 Consultations
66 Téléchargements

Partager

Gmail Facebook X LinkedIn More