Text and Non-text Segmentation based on Connected Component Features

Viet Phuong Le; Nibal Nayef; Muriel Visani; Jean-Marc Ogier; De Cao Tran

Communication Dans Un Congrès Année : 2015

Text and Non-text Segmentation based on Connected Component Features

(1) , (1) , (1) , (1) , (2)

1
2

Viet Phuong Le

Fonction : Auteur
PersonId : 974158

Laboratoire Informatique, Image et Interaction - EA 2118

Nibal Nayef

Fonction : Auteur
PersonId : 8626
IdHAL : nibal-nayef
IdRef : 195053486

Laboratoire Informatique, Image et Interaction - EA 2118

Muriel Visani

Fonction : Auteur
PersonId : 864965

Laboratoire Informatique, Image et Interaction - EA 2118

Jean-Marc Ogier

Fonction : Auteur
PersonId : 833747

Laboratoire Informatique, Image et Interaction - EA 2118

De Cao Tran

Fonction : Auteur
PersonId : 872602

Can Tho University [Vietnam]

Résumé

Document image segmentation is crucial to OCR and other digitization processes. In this paper, we present a learning-based approach for text and non-text separation in document images. The training features are extracted at the level of connected components, a mid-level between the slow noise-sensitive pixel level, and the segmentation-dependent zone level. Given all types, shapes and sizes of connected components, we extract a powerful set of features based on size, shape, stroke width and position of each connected component. Adaboosting with Decision trees is used for labeling connected components. Finally, the classification of connected components into text and non-text is corrected based on classification probabilities and size as well as stroke width analysis of the nearest neighbors of a connected component. The performance of our approach has been evaluated on the two standard datasets: UW-III and ICDAR-2009 competition for document layout analysis. Our results demonstrate that the proposed approach achieves competitive performance for segmenting text and non-text in document images of variable content and degradation.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement des images [eess.IV] Traitement du texte et du document

Nibal Nayef : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01319903

Soumis le : lundi 23 mai 2016-11:04:48

Dernière modification le : jeudi 12 mai 2022-15:37:47

Dates et versions

hal-01319903 , version 1 (23-05-2016)

Identifiants

HAL Id : hal-01319903 , version 1

Citer

Viet Phuong Le, Nibal Nayef, Muriel Visani, Jean-Marc Ogier, De Cao Tran. Text and Non-text Segmentation based on Connected Component Features. International Conference on Document Analysis and Recognition (ICDAR 2015), Aug 2015, Nancy, France. pp.1096 - 1100. ⟨hal-01319903⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

L3I UNIV-ROCHELLE

95 Consultations

0 Téléchargements

Text and Non-text Segmentation based on Connected Component Features

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager