The BIR database – Identifying typographic emphasis in list-like historical documents - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

The BIR database – Identifying typographic emphasis in list-like historical documents

Résumé

Layout analysis and optical character recognition have become traditional tasks for processing historical prints, but are now insufficient. Additional information is found in typographic emphasis, such as bold and italic letters. They carry semantic meaning (titles, emphasis...) and also outline the structure of the page (entries, sub-parts...). Retrieving such data is therefore crucial for information extraction and automatic document structuring. In this paper, we introduce the Bold-Italic-Regular (BIR) database, which contains 285 pages of scanned, list-like historical prints that have been annotated at word level with bold and italic emphasis. Baseline results are provided for word detection and style classification using state-of-the-art deep neural network models, highlighting promising possibilities, such as near-human performance for isolated word classification, but also demonstrating limitations for the task at hand.
Fichier principal
Vignette du fichier
pg37_hip21-22_Scius-Bertrand.pdf (3.63 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03355683 , version 1 (27-09-2021)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

Citer

Anna Scius Bertrand, Simon Gabay, Ljudmila Petkovic, Juliette Janes, Caroline Corbières, et al.. The BIR database – Identifying typographic emphasis in list-like historical documents. HIP@ICDAR21 - The 6th International Workshop on Historical Document Imaging and Processing, Sep 2021, Lausanne, Switzerland. ⟨10.1145/3476887.3476913⟩. ⟨hal-03355683⟩
127 Consultations
63 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More