Entity Matching in OCRed Documents with Redundant Databases - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

Entity Matching in OCRed Documents with Redundant Databases

Résumé

This paper presents an entity recognition approach on documents recognized by OCR (Optical Character Recognition). The recognition is formulated as a task of matching entities in a database with their representations in a document. A pre-processing step of entity resolution is performed on the database to provide a better representation of the entities. For this, a statistical model based on record linkage and record merge phases is used. Furthermore, documents recognized by OCR can contain noisy data and altered structure. An adapted method is proposed to retrieve the entities from their structures by tolerating possible OCR errors. A modified version of EROCS is applied to this problem by adapting the notion of segments to blocks provided by the OCR. It handles document segments to match the document to its corresponding entities. For efficiency, a process of data labeling in the document is applied in order to filter the compared entities and segments. The evaluation on business documents shows a significant improvement of matching rates compared to those of EROCS.
Fichier principal
Vignette du fichier
51773 (2).pdf (890.12 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-01144641 , version 1 (22-04-2015)

Identifiants

Citer

Nihel Kooli, Abdel Belaïd. Entity Matching in OCRed Documents with Redundant Databases. International Conference on Pattern Recognition Applications and Methods (ICPRAM-2015), Jan 2015, Lisbon, Portugal. ⟨10.5220/0005177301650172⟩. ⟨hal-01144641⟩
272 Consultations
274 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More