Vers une solution légère de production de données pour le TAL : création d'un tagger de l'alsacien par crowdsourcing bénévole

Alice Millour; Karën Fort; Delphine Bernhard; Lucie Steiblé

Communication Dans Un Congrès Année : 2017

Toward a lightweight solution to the language resources bottleneck issue: creating a POS tagger for Alsatian using voluntary crowdsourcing

Vers une solution légère de production de données pour le TAL : création d'un tagger de l'alsacien par crowdsourcing bénévole

(1) , (1) , (2) , (2)

1
2

Alice Millour

Fonction : Auteur
PersonId : 21553
IdHAL : alice-millour
IdRef : 253127947

Sens, Texte, Informatique, Histoire

Karën Fort

Fonction : Auteur
PersonId : 2215
IdHAL : karen-fort
ORCID : 0000-0002-0723-8850
IdRef : 176299548

Sens, Texte, Informatique, Histoire

Delphine Bernhard

Fonction : Auteur
PersonId : 119
IdHAL : delphine-bernhard
ORCID : 0000-0001-7857-5873
IdRef : 112578063

Linguistique, Langues et Parole

Lucie Steiblé

Fonction : Auteur
PersonId : 7982
IdHAL : lucie-steible
IdRef : 186348738

Linguistique, Langues et Parole

Résumé

We present here the results of an experiment on part-of-speech annotation of a corpus in a low-resourced regional language, Alsatian, using a specifically-developed voluntary crowdsourcing platform: Bisame. 1 It has been online since May 2016 and has allowed to gather 15,846 annotations, thanks to 42 participants. An evaluation performed on a reference corpus shows a F-measure of 0.93 of the produced annotations. The tagger trained on these annotations is accurate in 82% of the cases. This is the first POS tagger developed for Alsatian. This language resources development method proved to be efficient and promising for some low-resourced languages, for which a significant number of speakers have access to the Internet. The platform code, the annotated corpus and the tagger are all freely available.

Nous présentons ici les résultats d'une expérience menée sur l'annotation en parties du discours d'un corpus d'une langue régionale encore peu dotée, l'alsacien, via une plateforme de myriadisation (crowdsourcing) bénévole développée spécifiquement à cette fin : Bisame 1. La plateforme, mise en ligne en mai 2016, nous a permis de recueillir 15 846 annotations grâce à 42 participants. L'évaluation des annotations, réalisée sur un corpus de référence, montre que la F-mesure des annotations volon-taires est de 0, 93. Le tagger entraîné sur le corpus annoté atteint lui 82 % d'exactitude. Il s'agit du premier tagger spécifique à l'alsacien. Cette méthode de développement de ressources langagières est donc efficace et prometteuse pour certaines langues peu dotées, dont un nombre suffisant de locuteurs est connecté et actif sur le Web. Le code de la plateforme, le corpus annoté et le tagger sont librement disponibles.

Mots clés

crowdsourcing POS-Tagging Alsatian low-resourced languages

Domaines

Traitement du texte et du document

Fichier principal

taln2017_alsacien.pdf (1.06 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Karën Fort : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01516226

Soumis le : jeudi 5 octobre 2017-11:42:51

Dernière modification le : lundi 13 mars 2023-11:12:17

Dates et versions

hal-01516226 , version 1 (29-04-2017)

hal-01516226 , version 2 (05-10-2017)

Identifiants

HAL Id : hal-01516226 , version 2

Citer

Alice Millour, Karën Fort, Delphine Bernhard, Lucie Steiblé. Vers une solution légère de production de données pour le TAL : création d'un tagger de l'alsacien par crowdsourcing bénévole. Traitement Automatique des Langues Naturelles (TALN), Jun 2017, Orléans, France. ⟨hal-01516226v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CAMPUS-AAR AAI SITE-ALSACE SORBONNE-UNIVERSITE STIH SU-LETTRES

256 Consultations

238 Téléchargements

Toward a lightweight solution to the language resources bottleneck issue: creating a POS tagger for Alsatian using voluntary crowdsourcing

Vers une solution légère de production de données pour le TAL : création d'un tagger de l'alsacien par crowdsourcing bénévole

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager