Des dictionnaires éditoriaux aux représentations XML standardisées - Archive ouverte HAL Accéder directement au contenu
Chapitre D'ouvrage Année : 2013

Des dictionnaires éditoriaux aux représentations XML standardisées

Résumé

Create an electronic dictionary from scratch is an expensive job because this task mobilizes over a long period, the work of skilled contributors, if not in lexicology, at least in linguistics. The use of specialized computer tools is essential for resources used by programs in natural language processing. When the socio-economic environment does not gather the necessary resources to the drafting of an electronic dictionary and printed dictionaries exist, these dictionaries are an important resource that can be used to initialize the creation of electronic lexical resources. This paper presents theoretical and practical aspects concerning the conversion of publishing dictionaries to electronic lexical resources. It takes into account the issue of limited economic resources, technology and the availability of qualified persons. Our field experiments concerns under-resourced languages mainly in Southeast Asia (Khmer, Malay, Vietnamese) and the Sahel (Bambara, Hausa, Kanuri, Tamajaq, Zarma), as most of the examples and socio-linguistic situations described in the paper relate to these areas. After a brief history devoted to the formats of electronic dictionaries (SGML, XML, XSLT and CSS), we present two standards that are dedicated to them (Text Encoding Initiative and Lexical Markup Framework). The issue of under-resourced languages is exposed and is followed by some examples concerning published dictionaries. The main technical challenges are detailed like the lack of standardization of the alphabets used and special characters (outside the traditional latin range). The conversion methodology is outlined and then detailed. The conversion to a bridge format in XML can be done by regular expressions or using specialized tools. Then, the bridge format is converted into the target format in LMF. The last part is dedicated to the consultation of resources through an online platform resource management.

Mots clés

Fichier principal
Vignette du fichier
Livre-Nuria_MM-CE_V16.pdf (667.26 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00959229 , version 1 (01-04-2014)
hal-00959229 , version 2 (28-09-2019)

Identifiants

Citer

Mathieu Mangeot, Chantal Enguehard. Des dictionnaires éditoriaux aux représentations XML standardisées. Gala, Nuria and Zock, Michael. Ressources Lexicales : contenu, construction, utilisation, évaluation, John Benjamins, pp.24, 2013, ⟨10.1075/lis.30.08man⟩. ⟨hal-00959229v2⟩
263 Consultations
809 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More