Text extraction in document images: highlight on using corner points - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Text extraction in document images: highlight on using corner points

Vikas Yadav
  • Fonction : Auteur
  • PersonId : 976348

Résumé

During past years, text extraction in document images has been widely studied in the general context of Document Image Analysis (DIA) and especially in the framework of layout analysis. Many existing techniques rely on complex processes based on preprocessing, image transforms or component/edges extraction and their analysis. At the same time, text extraction inside videos has received an increased interest and the use of corner or key points has been proven to be very effective. Because it is noteworthy to notice that very few studies were performed on the use of corner points for text extraction in document images, we propose in this paper to evaluate the possibilities associated with this kind of approach for DIA. To do that, we designed a very simple technique based on FAST key points. A first stage divide the image into blocks and the density of points inside each one is computed. The more dense ones are kept as text blocks. Then, connectivity of blocks is checked to group them and to obtain complete text blocks. This technique has been evaluated on different kind of images: different languages (Telugu, Arabic, French), handwritten as well as typewritten, skewed documents, images at different resolution and with different kind and amount of noises (deformations, ink dot, bleed through, acquisition (blur, resolution)), etc. Even with fixed parameters for all such kind of documents images, the precision and recall are close or higher to 90% which makes this basic method already effective. Consequently, even if the proposed approach does not propose a breakthrough from theoretical aspects, it highlights that accurate text extraction could be achieved without complex approach. Moreover, this approach could also be easily improved to be more precise, robust and useful for more complex layout analysis.
Fichier non déposé

Dates et versions

hal-01269802 , version 1 (05-02-2016)

Identifiants

  • HAL Id : hal-01269802 , version 1

Citer

Vikas Yadav, Nicolas Ragot. Text extraction in document images: highlight on using corner points. International Workshop on Document Analysis Systems (DAS), Apr 2016, Santorini, Greece. ⟨hal-01269802⟩
74 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More