Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition - Archive ouverte HAL Accéder directement au contenu
Chapitre D'ouvrage Année : 2020

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

Résumé

The accessibility to digitized documents in digital libraries is greatly affected by the quality of document indexing. Among the most relevant information to index, named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors hinder their accessibility. This paper aims to quantitatively estimate the impact of OCR quality on the performance of named entity recognition (NER). We tested state-of-the-art NER techniques over several evaluation benchmarks, and experimented with various levels and types of OCR noise so as to estimate the impact of OCR noise on NER performance. To the best of our knowledge, no other research work has systematically studied the impact of OCR on named entity recognition over data sets in multiple languages. The final outcome of this study is an evaluation over historical newspaper data provided by the national library of Finland, resulting in a large increase over the best-known results to this day.
Fichier principal
Vignette du fichier
TPDL_2020_paper_45.pdf (588.81 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03026931 , version 1 (26-11-2020)

Identifiants

Citer

Ahmed Hamdi, Axel Jean-Caurant, Nicolas Sidère, Mickaël Coustaty, Antoine Doucet. Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition. Digital Libraries for Open Knowledge 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Lyon, France, August 25–27, 2020, Proceedings, pp.87-101, 2020, ⟨10.1007/978-3-030-54956-5_7⟩. ⟨hal-03026931⟩

Collections

L3I UNIV-ROCHELLE
87 Consultations
891 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More