C-structures and f-structures for the British National Corpus

Joachim Wagner; Djamé Seddah; Jennifer Foster; Josef van Genabith

Communication Dans Un Congrès Année : 2007

C-structures and f-structures for the British National Corpus

(1) , (2) , (1) , (1)

1
2

Joachim Wagner

Fonction : Auteur

National Centre for Language Technology

Djamé Seddah

Fonction : Auteur
PersonId : 11545
IdHAL : djameseddah
IdRef : 086185136

Langues, logiques, informatiques, cognition

Jennifer Foster

Fonction : Auteur

National Centre for Language Technology

Josef van Genabith

Fonction : Auteur

National Centre for Language Technology

Résumé

We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, and an annotation algorithm to automatically annotate these trees into LFG f-structures. We describe the pre-processing steps which were taken to accommodate the differences between the Penn Treebank and the BNC. Some of the issues encountered in applying the parsing architecture on such a large scale are discussed. The process of annotating a gold standard set of 1,000 parse trees is described. We present evaluation results obtained by evaluating the c-structures produced by the statistical parser against the c-structure gold standard. We also present the results obtained by evaluating the f-structures produced by the annotation algorithm against an automatically constructed f-structure gold standard. The c-structures achieve an f-score of 83.7% and the f-structures an f-score of 91.2%.

Mots clés

parsing Probabilistic parsing out of domain parsing BNC

Domaines

Traitement du texte et du document

Fichier principal

lfg07wagneretal.pdf (112.54 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Brigitte Briot : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00545440

Soumis le : vendredi 10 décembre 2010-11:19:56

Dernière modification le : vendredi 24 mars 2023-14:52:53

Archivage à long terme le : vendredi 11 mars 2011-03:20:54

Dates et versions

inria-00545440 , version 1 (10-12-2010)

Identifiants

HAL Id : inria-00545440 , version 1

Citer

Joachim Wagner, Djamé Seddah, Jennifer Foster, Josef van Genabith. C-structures and f-structures for the British National Corpus. Proceedings of the Twelfth International Lexical Functional Grammar Conference, 2007, Stanford, CA, United States. ⟨inria-00545440⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS SORBONNE-UNIVERSITE SU-LETTRES

99 Consultations

294 Téléchargements

C-structures and f-structures for the British National Corpus

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager