Skip to Main content Skip to Navigation
Conference papers

Bootstrapped OCR error detection for a less-resourced language variant

Abstract : This study focuses on isolated error detection in a retro-digitized newspaper corpus published from 1946 to 1990 in the former German Democratic Republic. As there are OCR errors throughout the corpus but no clean reference for this variant of German, automatic OCR correction implies to overcome data sparseness and non-standard spelling, including compounds and inflected forms. The contributions of this paper are (1) a method to bootstrap detection of potential misspellings, (2) an assessment of several types of training data, and (3) an evaluation of several off-the-shelf candidate selection techniques. The chosen solution based on statistical affix analysis reaches an accuracy 10 points higher than existing morphological analysis systems on error detection, while a combination of fuzzy and approximate string search performs best for error correction. The criteria are met since it is possible to correct erroneous tokens without introducing too much noise.
Complete list of metadata

Cited literature [26 references]  Display  Hide  Download
Contributor : Adrien Barbaresi Connect in order to contact the contributor
Submitted on : Monday, September 26, 2016 - 1:09:32 PM
Last modification on : Wednesday, December 12, 2018 - 1:32:04 PM
Long-term archiving on: : Tuesday, December 27, 2016 - 1:07:21 PM


Publisher files allowed on an open archive


  • HAL Id : hal-01371689, version 1



Adrien Barbaresi. Bootstrapped OCR error detection for a less-resourced language variant. 13th Conference on Natural Language Processing (KONVENS 2016), Sep 2016, Bochum, Germany. pp.21-26. ⟨hal-01371689⟩



Record views


Files downloads