Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling

Alice Millour
Karën Fort

Résumé

Non-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Al-satian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%.
Fichier principal
Vignette du fichier
Proceedings_of_Recent_Advances_in_Natural_Language_Processing.pdf (1.09 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02280002 , version 1 (05-09-2019)

Identifiants

  • HAL Id : hal-02280002 , version 1

Citer

Alice Millour, Karën Fort. Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling. RANLP, Sep 2019, Varna, Bulgaria. pp.776 - 784. ⟨hal-02280002⟩
120 Consultations
88 Téléchargements

Partager

Gmail Facebook X LinkedIn More