Skip to Main content Skip to Navigation
Conference papers

Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling

Abstract : Non-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Al-satian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%.
Complete list of metadata

Cited literature [25 references]  Display  Hide  Download
Contributor : Alice Millour Connect in order to contact the contributor
Submitted on : Thursday, September 5, 2019 - 5:35:49 PM
Last modification on : Saturday, December 4, 2021 - 4:03:41 AM
Long-term archiving on: : Thursday, February 6, 2020 - 1:59:17 AM


Files produced by the author(s)


  • HAL Id : hal-02280002, version 1


Alice Millour, Karën Fort. Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling. RANLP, Sep 2019, Varna, Bulgaria. pp.776 - 784. ⟨hal-02280002⟩



Record views


Files downloads