Skip to Main content Skip to Navigation
Conference papers

Text Content Based Layout Analysis

Abstract : State-of-the-art Document Layout Analysis methodsrely on graphical appearance features in order to detect andclassify the different layout regions present in a scanned textimage. In many cases, however, performing this task using onlygraphical information is problematic or impossible. Only byactually reading some text in the boundaries of the problematicregions it becomes possible to reliably detect and separate theseregions. In these situations, textual, content-based features wouldbe required, but since transcription is usually performed afterlayout analysis, a vicious circle arises. In this work, we circumventthis deadlock by making use of the recently introduced concept ofProbabilistic Index Map. We use the word relevance probabilitiesprovided by this map to calculate relevant text content basedfeatures at the pixel level. We assess the impact of these newfeatures on a historical document complex paragraph classifica-tion task. The experiments are performed using both a classicalHidden Markov Model approach and Deep Neural Networks.The obtained results are encouraging and showcase the positiveimpact text content based features will have on the DocumentLayout Analysis research field.
Complete list of metadata
Contributor : Dominique Stutzmann Connect in order to contact the contributor
Submitted on : Thursday, September 10, 2020 - 8:24:23 AM
Last modification on : Monday, May 17, 2021 - 11:44:26 AM




José Ramón Prieto, Vicente Bosch, Enrique Vidal, Dominique Stutzmann, Sébastien Hamel. Text Content Based Layout Analysis. 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Sep 2020, Dortmund, Germany. pp.258-263, ⟨10.1109/ICFHR2020.2020.0005⟩. ⟨hal-02935071⟩



Record views