Bootstrapped OCR error detection for a less-resourced language variant

Adrien Barbaresi

Communication Dans Un Congrès Année : 2016

Bootstrapped OCR error detection for a less-resourced language variant

(1, 2)

1
2

Adrien Barbaresi

Fonction : Auteur
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Berlin-Brandenburgische Akademie der Wissenschaften

Austrian Academy of Sciences

Résumé

This study focuses on isolated error detection in a retro-digitized newspaper corpus published from 1946 to 1990 in the former German Democratic Republic. As there are OCR errors throughout the corpus but no clean reference for this variant of German, automatic OCR correction implies to overcome data sparseness and non-standard spelling, including compounds and inflected forms. The contributions of this paper are (1) a method to bootstrap detection of potential misspellings, (2) an assessment of several types of training data, and (3) an evaluation of several off-the-shelf candidate selection techniques. The chosen solution based on statistical affix analysis reaches an accuracy 10 points higher than existing morphological analysis systems on error detection, while a combination of fuzzy and approximate string search performs best for error correction. The criteria are met since it is possible to correct erroneous tokens without introducing too much noise.

Mots clés

OCR error correction Affix trees Cultural Heritage Morphological Analysis

Domaines

Linguistique Informatique et langage [cs.CL] Héritage culturel et muséologie

Fichier principal

Barbaresi_OCR-error-detection_2016.pdf (234.77 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01371689

Soumis le : lundi 26 septembre 2016-13:09:32

Dernière modification le : mercredi 12 décembre 2018-13:32:04

Archivage à long terme le : mardi 27 décembre 2016-13:07:21

Dates et versions

hal-01371689 , version 1 (26-09-2016)

Identifiants

HAL Id : hal-01371689 , version 1

Citer

Adrien Barbaresi. Bootstrapped OCR error detection for a less-resourced language variant. 13th Conference on Natural Language Processing (KONVENS 2016), Sep 2016, Bochum, Germany. pp.21-26. ⟨hal-01371689⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

149 Consultations

236 Téléchargements

Bootstrapped OCR error detection for a less-resourced language variant

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager