Skip to Main content Skip to Navigation
Conference papers

OCR performance prediction using cross-OCR alignment

Abstract : Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition software (OCR). The modern technologies of OCR achieve good performances on modern documents produced with uniform layout and known fonts. However, for old documents, OCR results are of lower quality. The OCR quality assessment is a real challenge for the BnF. On the one hand, due to the sequential architecture of OCR treatments, the identification of OCR errors sources is intractable. On the other hand, besides the word confidence, no additional quality information is reported in OCR outputs. In this paper, we present a study on OCR performance estimation aiming to control the quality of word transcriptions achieved by OCR. This quality assessment process has to operate without any comparison with ground truthed data. In this respect, our methodology relies on cross alignment of the OCR results with those of a secondary OCR called reference OCR. This secondary OCR provides uncertain but useful information that will be used as uncertain groundtruth. OCR performance is estimated using support vector regression. This predictor uses some global features computed on the cross-alignment results. The experimentations reported show that our estimate describes more faithfully the quality of OCR outputs than average word confidence scores that are computed by OCR. The proposed methodology can be adapted easily to various corpora by tuning the system using a training dataset of documents that have similar properties to those to be treated.
Document type :
Conference papers
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-01191701
Contributor : Nicolas Ragot Connect in order to contact the contributor
Submitted on : Wednesday, September 2, 2015 - 1:33:57 PM
Last modification on : Tuesday, January 11, 2022 - 5:56:22 PM

Identifiers

  • HAL Id : hal-01191701, version 1

Citation

Ahmed Ben Salah, Jean-Philippe Moreux, Nicolas Ragot, Thierry Paquet. OCR performance prediction using cross-OCR alignment. 13th International Conference on Document Analysis and Recognition (ICDAR 2015), Aug 2015, Nancy, France. ⟨hal-01191701⟩

Share

Metrics

Les métriques sont temporairement indisponibles