Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe

Aleksandra Miletic; Cécile Fabre; Dejan Stosic

Communication Dans Un Congrès Année : 2016

Developping a method for detailed morphosyntactic tagging of Serbian

Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe

(1) , (1) , (1)

Aleksandra Miletic

Fonction : Auteur
PersonId : 1028050

Cognition, Langues, Langage, Ergonomie

Cécile Fabre

Fonction : Auteur
PersonId : 6972
IdHAL : cecilefabre
ORCID : 0000-0002-6954-9224
IdRef : 052157776

Cognition, Langues, Langage, Ergonomie

Dejan Stosic

Fonction : Auteur
PersonId : 11321
IdHAL : dejan-stosic
ORCID : 0000-0003-3853-983X
IdRef : 070166846

Cognition, Langues, Langage, Ergonomie

Résumé

Developping a method for detailed morphosyntactic tagging of Serbian This paper presents an experience in detailed morphosyntactic tagging of the Serbian subcorpus of the parallel Serbian-French-English ParCoLab corpus. We enriched an existing POS annotation with finer-grained morphosyntactic properties in order to prepare the corpus for subsequent parsing stages. We compared three approaches: 1) manual annotation; 2) pre-annotation with a tagger trained on Croatian, followed by manual correction; 3) retraining the model on a small validated sample of the corpus (20K tokens), followed by automatic annotation and manual correction. The Croatian model maintains its global stability when applied to Serbian texts, but due to the differences between the two tagsets, important manual interventions were still required. A new model was trained on a validated sample of the corpus: it has the same accuracy as the existing model, but the observed acceleration of the manual correction confirms that it is better suited to the task than the first one. MOTS-CLES : Annotation morphosyntaxique, corpus d'entraînement, serbe.

Cet article présente une expérience d'annotation morphosyntaxique fine du volet serbe du corpus parallèle ParCoLab (corpus serbe-français-anglais). Elle a consisté à enrichir une annotation existante en parties du discours avec des traits morphosyntaxiques fins, afin de préparer une étape ultérieure de parsing. Nous avons comparé trois approches : 1) annotation manuelle ; 2) pré-annotation avec un étiqueteur entraîné sur le croate suivie d'une correction manuelle ; 3) ré-entraînement de l'outil sur un petit échantillon validé du corpus, suivi de l'annotation automatique et de la correction manuelle. Le modèle croate maintient une stabilité globale en passant au serbe, mais les différences entre les deux jeux d'étiquettes exigent des interventions manuelles importantes. Le modèle ré-entraîné sur un échantillon de taille limité (20K tokens) atteint la même exactitude que le modèle existant et le gain de temps observé montre que cette méthode optimise la phase de correction.

Mots clés

Morphosyntactic tagging training corpus Serbian

serbe corpus d’entraînement Annotation morphosyntaxique

Domaines

Linguistique

Fichier principal

Miletic_et_al_TALN2016.pdf (2.14 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Cécile Fabre : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01377060

Soumis le : jeudi 6 octobre 2016-18:30:47

Dernière modification le : vendredi 26 mai 2023-11:18:08

Archivage à long terme le : samedi 7 janvier 2017-12:40:32

Dates et versions

hal-01377060 , version 1 (06-10-2016)

Identifiants

HAL Id : hal-01377060 , version 1

Citer

Aleksandra Miletic, Cécile Fabre, Dejan Stosic. Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe. Conférence conjointe JEP-TALN-RECITAL 2016, ATALA, Jul 2016, Paris, France. pp.506-513. ⟨hal-01377060⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EPHE UNIV-TLSE2 CNRS CLLE PSL UNIV-BORDEAUX-MONTAIGNE

349 Consultations

84 Téléchargements

Developping a method for detailed morphosyntactic tagging of Serbian

Mise au point d'une méthode d'annotation morphosyntaxique fine du serbe

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager