Old document image segmentation using the autocorrelation function and multiresolution analysis

Abstract : Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges in information retrieval in digital libraries and document layout analysis. Therefore, in order to control the quality of historical document image digitization and to meet the need of a characterization of their content using intermediate level metadata (between image and document structure), we propose a fast automatic layout segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The method proposed in this article has the advantage that it is performed without any hypothesis on the document structure, either about the document model (physical structure), or the typographical parameters (logical structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly, we detail our proposal to characterize the content of old documents by extracting the autocorrelation features in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure, which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85% of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of homogeneous blocks and their topology.
Type de document :
Communication dans un congrès
SPIE. Document Recognition and Retrieval XX, Feb 2013, San Francisco, United States. SPIE, 8658 (18), pp.8658-18, 2013, Document Recognition and Retrieval XX, Richard Zanibbi; Bertrand Coüasnon, Editors, 86580K. 〈10.1117/12.2002365〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-00787779
Contributeur : Maroua Mehri <>
Soumis le : mercredi 13 février 2013 - 19:00:03
Dernière modification le : mercredi 11 octobre 2017 - 11:18:01
Document(s) archivé(s) le : mardi 14 mai 2013 - 04:03:58

Fichier

MarouaMEHRI_DRR2013.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Maroua Mehri, Petra Gomez-Krämer, Pierre Héroux, Rémy Mullot. Old document image segmentation using the autocorrelation function and multiresolution analysis. SPIE. Document Recognition and Retrieval XX, Feb 2013, San Francisco, United States. SPIE, 8658 (18), pp.8658-18, 2013, Document Recognition and Retrieval XX, Richard Zanibbi; Bertrand Coüasnon, Editors, 86580K. 〈10.1117/12.2002365〉. 〈hal-00787779〉

Partager

Métriques

Consultations de
la notice

244

Téléchargements du document

458