Skip to Main content Skip to Navigation
Conference papers

Improving generative statistical parsing with semi-supervised word clustering

Marie Candito 1 Benoît Crabbé 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, and we train a parser with the PCFG-LA unlexicalized algorithm of Petrov et al. (2006). We find a gain in French parsing performance: from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are encouraging for statistically parsing morphologically rich languages, and languages with small amount of annotated data.
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download
Contributor : Marie Candito <>
Submitted on : Tuesday, September 7, 2010 - 3:46:14 PM
Last modification on : Friday, March 27, 2020 - 3:54:51 AM
Document(s) archivé(s) le : Wednesday, December 8, 2010 - 2:26:38 AM


Files produced by the author(s)


  • HAL Id : hal-00495267, version 1



Marie Candito, Benoît Crabbé. Improving generative statistical parsing with semi-supervised word clustering. 11th International Conference on Parsing Technologies - IWPT'09, Oct 2009, Paris, France. pp.169-172. ⟨hal-00495267⟩



Record views


Files downloads