Une approche probabiliste pour la reconnaissance des sommaires

Abstract : Document Analysis and Recognition consist in translating their images into an electronic form that can be reusable. The analysis extracts the document layout structure from its image, and the recognition assigns to the layout structure components their logical functions in the document. In this article, we present our work on recognition of a category of documents in which the logical structure is based on typographical tagging such as table of contents. We propose a perceptual approach that extracts these typographical tagging directly from document images. However, the structures of such documents are complex and variable. Their complexity can cause errors in the analysis output, which influence directly the recognition task, while their variability requires defining a generic form of logical structures and the related recognition tasks. Our goal is to consider the document structure recognition problem even though these difficulties occur. We developed a automatic recognition system based on a hybrid model combining a bayesian classifier and a probabilistic automaton. The classifier is responsible of drawing a correspondence between text blocks extracted from document images and basic logical entities, while the automaton deals with grouping these entities into a hierarchical logical structure. This hybrid model is built by semi-supervised learning based on knowledge provided by the user on the one hand, and the typographical properties of our documents, on the other hand. This system has been experimented for automatic indexing of tables of contents in periodicals and journals. The complexity and the variability of these documents allow us to show the efficiency of the approach.
Document type :
Journal articles
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01592831
Contributor : Équipe Gestionnaire Des Publications Si Liris <>
Submitted on : Monday, September 25, 2017 - 2:38:55 PM
Last modification on : Thursday, November 1, 2018 - 1:20:03 AM

Identifiers

  • HAL Id : hal-01592831, version 1

Citation

Souad Souafi Bensafi, Hubert Emptoz, Frank Le Bourgeois, Marc Parizeau. Une approche probabiliste pour la reconnaissance des sommaires. Traitement du Signal, Lavoisier, 2005, 3, 22, pp.191-208. ⟨hal-01592831⟩

Share

Metrics

Record views

105