Skip to Main content Skip to Navigation
Theses

Myriadisation de ressources linguistiques pour le traitement automatique de langues non standardisées

Abstract : Citizen science, in particular voluntary crowdsourcing, represents a little experimented solution to produce language resources for some languages which are still little resourced despite the presence of sufficient speakers online. We present in this work the experiments we have led to enable the crowdsourcing of linguistic resources for the development of automatic part-of-speech annotation tools. We have applied the methodology to three non-standardised languages, namely Alsatian, Guadeloupean Creole and Mauritian Creole. For different historical reasons, multiple (ortho)-graphic practices coexist for these three languages. The difficulties encountered by the presence of this variation phenomenon led us to propose various crowdsourcing tasks that allow the collection of raw corpora, part-of-speech annotations, and graphic variants. The intrinsic and extrinsic analysis of these resources, used for the development of automatic annotation tools, show the interest of using crowdsourcing in a non-standardized linguistic framework: the participants are not seen in this context a uniform set of contributors whose cumulative efforts allow the completion of a particular task, but rather as a set of holders of complementary knowledge. The resources they collectively produce make possible the development of tools that embrace the variation. The platforms developed, the language resources, as well as the models of trained taggers are freely available.
Document type :
Theses
Complete list of metadata

https://hal.archives-ouvertes.fr/tel-03083213
Contributor : Alice Millour Connect in order to contact the contributor
Submitted on : Wednesday, January 6, 2021 - 12:28:38 PM
Last modification on : Saturday, December 4, 2021 - 4:09:03 AM

File

These_Millour_2020.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-03083213, version 2

Citation

Alice Millour. Myriadisation de ressources linguistiques pour le traitement automatique de langues non standardisées. Informatique et langage [cs.CL]. Sorbonne Universite, 2020. Français. ⟨tel-03083213v2⟩

Share

Metrics

Record views

77

Files downloads

90