A search engine for Arabic documents - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2008

A search engine for Arabic documents

Résumé

This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.
Fichier principal
Vignette du fichier
paper-21.pdf (3.14 Mo) Télécharger le fichier
Origine : Accord explicite pour ce dépôt
Loading...

Dates et versions

hal-00334402 , version 1 (26-10-2008)

Identifiants

  • HAL Id : hal-00334402 , version 1

Citer

T. Sari, A. Kefali. A search engine for Arabic documents. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. pp.97-102. ⟨hal-00334402⟩

Collections

CIFED08
230 Consultations
774 Téléchargements

Partager

Gmail Facebook X LinkedIn More