Dictionaries for Under-Resourced Languages: from Published Files to Standardized Resources Available on the Web

Abstract : Most work in the feld of natural language processing focuses on well-resourced languages. However, much remains to be done on under-resourced ones: there are few dictionaries, parsers, etc. Nevertheless, when published dictionaries are available, it is sometimes possible to fnd the data fles used to print the dictionar. (usuall. in Word format). A conversion process can then be applied to these fles in order to obtain standardized XML lexical data. Attention must be paid to specifc problems such as a lack of standardization in the alphabets or the use of hacked fonts for displa.ing specifc characters. Next, the standardized XML data can be imported into an online lexical resources management platform. It is then available online for lookup and editing. A fnal step can also be performed to automaticall. export the data into interchange formats such as Lexical Markup Framework or lemon in order to produce linked data.
Document type :
Reports
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-02056905
Contributor : Mathieu Mangeot <>
Submitted on : Monday, March 4, 2019 - 9:39:39 PM
Last modification on : Tuesday, April 2, 2019 - 1:47:28 AM

File

LRE_MANGEOT-ENGUEHARD_v14.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02056905, version 1

Citation

Mathieu Mangeot, Chantal Enguehard. Dictionaries for Under-Resourced Languages: from Published Files to Standardized Resources Available on the Web. [Research Report] Laboratoire d'informatique de Grenoble. 2018. ⟨hal-02056905⟩

Share

Metrics

Record views

44

Files downloads

13