Tools and Methodologies for Annotating Syntax and Named Entities in the National Corpus of Polish - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Tools and Methodologies for Annotating Syntax and Named Entities in the National Corpus of Polish

Résumé

The on-going project aiming at the creation of the National Corpus of Polish assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the level of syntactic words and groups, and the level of named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus, and we discuss some particular problems faced during the elaboration of the syntactic grammar, which contains over 800 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customized for manual post-editing of annotations, and for further revision of discrepancies. Our XML format converters and customized archiving repository ensure the automatic data flow and efficient corpus file management. We believe that this environment or substantial parts of it can be reused in or adapted for other corpus annotation tasks.
Fichier non déposé

Dates et versions

hal-01024326 , version 1 (16-07-2014)

Identifiants

  • HAL Id : hal-01024326 , version 1

Citer

Jakub Waszczuk, Katarzyna Glowinska, Agata Savary, Adam Przepiorkowski. Tools and Methodologies for Annotating Syntax and Named Entities in the National Corpus of Polish. Computational Linguistics - Applications, 2010, Wisla, Poland. ⟨hal-01024326⟩
32 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More