Improving generative statistical parsing with semi-supervised word clustering - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2009

Improving generative statistical parsing with semi-supervised word clustering

Résumé

We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, and we train a parser with the PCFG-LA unlexicalized algorithm of Petrov et al. (2006). We find a gain in French parsing performance: from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are encouraging for statistically parsing morphologically rich languages, and languages with small amount of annotated data.
Fichier principal
Vignette du fichier
IWPT09-clustering.pdf (41.96 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00495267 , version 1 (07-09-2010)

Identifiants

  • HAL Id : hal-00495267 , version 1

Citer

Marie Candito, Benoît Crabbé. Improving generative statistical parsing with semi-supervised word clustering. 11th International Conference on Parsing Technologies - IWPT'09, Oct 2009, Paris, France. pp.169-172. ⟨hal-00495267⟩
286 Consultations
187 Téléchargements

Partager

Gmail Facebook X LinkedIn More