Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling

Abstract : Non-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Al-satian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%.
Complete list of metadatas

Cited literature [25 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02280002
Contributor : Alice Millour <>
Submitted on : Thursday, September 5, 2019 - 5:35:49 PM
Last modification on : Sunday, September 8, 2019 - 1:18:48 AM
Long-term archiving on: Thursday, February 6, 2020 - 1:59:17 AM

File

Proceedings_of_Recent_Advances...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02280002, version 1

Citation

Alice Millour, Karën Fort. Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling. RANLP, Sep 2019, Varna, Bulgaria. pp.776 - 784. ⟨hal-02280002⟩

Share

Metrics

Record views

40

Files downloads

44