A search engine for Arabic documents

T. Sari; A. Kefali

Communication Dans Un Congrès Année : 2008

A search engine for Arabic documents

(1) , (1)

T. Sari

Fonction : Auteur
PersonId : 171527
IdHAL : tewfiksari
ORCID : 0000-0002-6274-7826
IdRef : 095778101

Laboratoire de Gestion Electronique de Document [Annaba]

A. Kefali

Fonction : Auteur

Laboratoire de Gestion Electronique de Document [Annaba]

Résumé

This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.

Mots clés

handwriting segmentation Document retrieval Arabic handwriting recognition handwriting segmentation.

Domaines

Traitement du texte et du document Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement du signal et de l'image [eess.SP] Traitement du signal et de l'image [eess.SP]

Fichier principal

paper-21.pdf (3.14 Mo)

Origine : Accord explicite pour ce dépôt

Sébastien Adam : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00334402

Soumis le : dimanche 26 octobre 2008-01:33:44

Dernière modification le : lundi 12 février 2024-12:04:05

Archivage à long terme le : lundi 7 juin 2010-21:54:11

Dates et versions

hal-00334402 , version 1 (26-10-2008)

Identifiants

HAL Id : hal-00334402 , version 1

Citer

T. Sari, A. Kefali. A search engine for Arabic documents. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. pp.97-102. ⟨hal-00334402⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CIFED08

230 Consultations

774 Téléchargements

A search engine for Arabic documents

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager