Annotation tools for syntax and named entities in the National Corpus of Polish.

Jakub Waszczuk; Katarzyna Glowinska; Agata Savary; Adam Przepiórkowski; Michel Lenart

Article Dans Une Revue International Journal of Data Mining, Modelling and Management Année : 2013

Annotation tools for syntax and named entities in the National Corpus of Polish.

(1) , (1) , (2) , (1) , (3)

1
2
3

Jakub Waszczuk

Fonction : Auteur

Instytut Podstaw Informatyki

Katarzyna Glowinska

Fonction : Auteur

Instytut Podstaw Informatyki

Agata Savary

Fonction : Auteur
PersonId : 4644
IdHAL : agata-savary
IdRef : 113077661

Laboratoire d'Informatique Fondamentale et Appliquée de Tours

Adam Przepiórkowski

Fonction : Auteur

Instytut Podstaw Informatyki

Michel Lenart

Fonction : Auteur

Institute of Informatics

Résumé

The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.

Mots clés

file management corpus annotation National Corpus of Polish shallow parsing chunking grammars named entity recognition NER syntax named entities linguistic annotation syntactic words syntactic groups parser grammar XML converters customised archiving repository automatic data flow file management.

Domaines

Informatique et langage [cs.CL] Base de données [cs.DB]

Denis Maurel : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01021348

Soumis le : mercredi 9 juillet 2014-11:39:00

Dernière modification le : vendredi 16 février 2024-18:16:04

Dates et versions

hal-01021348 , version 1 (09-07-2014)

Identifiants

HAL Id : hal-01021348 , version 1

Citer

Jakub Waszczuk, Katarzyna Glowinska, Agata Savary, Adam Przepiórkowski, Michel Lenart. Annotation tools for syntax and named entities in the National Corpus of Polish.. International Journal of Data Mining, Modelling and Management, 2013, 5 (2), pp.103-122. ⟨hal-01021348⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TOURS CNRS LIFAT INSA-GROUPE INSA-CVL

51 Consultations

0 Téléchargements

Annotation tools for syntax and named entities in the National Corpus of Polish.

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager