Text Content Based Layout Analysis

José Ramón Prieto; Vicente Bosch; Enrique Vidal; Dominique Stutzmann; Sébastien Hamel

doi:10.1109/ICFHR2020.2020.0005

Communication Dans Un Congrès Année : 2020

Text Content Based Layout Analysis

(1) , (1) , (1) , (2) ,

1
2

José Ramón Prieto

Fonction : Auteur
PersonId : 1076875

Universitat Politècnica de València = Universitad Politecnica de Valencia = Polytechnic University of Valencia

Vicente Bosch

Fonction : Auteur
PersonId : 1076876

Universitat Politècnica de València = Universitad Politecnica de Valencia = Polytechnic University of Valencia

Enrique Vidal

Fonction : Auteur

Universitat Politècnica de València = Universitad Politecnica de Valencia = Polytechnic University of Valencia

Dominique Stutzmann

Fonction : Auteur
PersonId : 19130
IdHAL : dominique-stutzmann
ORCID : 0000-0003-3705-5825
IdRef : 08848565X

Institut de recherche et d'histoire des textes

Sébastien Hamel

Fonction : Auteur
PersonId : 743990
IdHAL : sebastien-hamel
ORCID : 0000-0003-1744-2537
IdRef : 090429796

Résumé

State-of-the-art Document Layout Analysis methodsrely on graphical appearance features in order to detect andclassify the different layout regions present in a scanned textimage. In many cases, however, performing this task using onlygraphical information is problematic or impossible. Only byactually reading some text in the boundaries of the problematicregions it becomes possible to reliably detect and separate theseregions. In these situations, textual, content-based features wouldbe required, but since transcription is usually performed afterlayout analysis, a vicious circle arises. In this work, we circumventthis deadlock by making use of the recently introduced concept ofProbabilistic Index Map. We use the word relevance probabilitiesprovided by this map to calculate relevant text content basedfeatures at the pixel level. We assess the impact of these newfeatures on a historical document complex paragraph classifica-tion task. The experiments are performed using both a classicalHidden Markov Model approach and Deep Neural Networks.The obtained results are encouraging and showcase the positiveimpact text content based features will have on the DocumentLayout Analysis research field.

Mots clés

Document Layout Analysis Text Content BasedFeatures Hidden Markov Models Deep Neural Networks

Domaines

Intelligence artificielle [cs.AI] Traitement du signal et de l'image [eess.SP] Histoire

Dominique Stutzmann : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02935071

Soumis le : jeudi 10 septembre 2020-08:24:23

Dernière modification le : mardi 5 septembre 2023-14:28:53

Dates et versions

hal-02935071 , version 1 (10-09-2020)

Identifiants

HAL Id : hal-02935071 , version 1
DOI : 10.1109/ICFHR2020.2020.0005

Citer

José Ramón Prieto, Vicente Bosch, Enrique Vidal, Dominique Stutzmann, Sébastien Hamel. Text Content Based Layout Analysis. 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Sep 2020, Dortmund, Germany. pp.258-263, ⟨10.1109/ICFHR2020.2020.0005⟩. ⟨hal-02935071⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS IRHT CAMPUS-CONDORCET SHMESP ANR

126 Consultations

0 Téléchargements

Text Content Based Layout Analysis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager