Logical Structure Extraction from Digitized Books - Archive ouverte HAL Accéder directement au contenu
Chapitre D'ouvrage Année : 2018

Logical Structure Extraction from Digitized Books

Résumé

Mass digitization projects, such as the Million Book Project, efforts of the Open Content Alliance, and the digitization work of Google, are converting whole libraries by digitizing books on an industrial scale [5]. The process involves the efficient photographing of books, page-by-page, and the conversion of the image of each page into searchable text through the use of optical character recognition (OCR) software. Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognized. In order to enable systems to provide users with richer browsing experiences, it is necessary to make such additional structures available, for example, in the form of XML markup embedded in the full text of the digitized books. The Book Structure Extraction competition aims to address this need by promoting research into automatic structure recognition and extraction techniques that could complement or enhance current OCR methods and Document Analysis and Text Recognition Downloaded from www.worldscientific.com by UNIVERSITY OF HELSINKI on 11/26/20. Re-use and distribution is strictly not permitted, except for Open Access articles.
Fichier principal
Vignette du fichier
9789813229273_0001(2).pdf (852.53 Ko) Télécharger le fichier
Origine : Accord explicite pour ce dépôt

Dates et versions

hal-03025598 , version 1 (15-12-2020)

Identifiants

Citer

Antoine Doucet. Logical Structure Extraction from Digitized Books. Document Analysis and Text Recognition Benchmarking State-of-the-Art Systems, pp.3-28, 2018, ⟨10.1142/9789813229273_0001⟩. ⟨hal-03025598⟩

Collections

L3I UNIV-ROCHELLE
37 Consultations
127 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More