Parsing word clusters

Marie Candito 1 Djamé Seddah 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : We present and discuss experiments in statistical parsing of French, where terminal forms used during training and parsing are replaced by more general symbols, particularly clusters of words obtained through unsupervised linear clustering. We build on the work of Candito and Crabbé (2009) who proposed to use clusters built over slightly coarsened French inflected forms. We investigate the alternative method of building clusters over lemma/part-of-speech pairs, using a raw corpus automatically tagged and lemmatized. We find that both methods lead to comparable improvement over the baseline (we obtain F_1=86.20% and F_1=86.21% respectively, compared to a baseline of F_1=84.10%). Yet, when we replace gold lemma/POS pairs with their corresponding cluster, we obtain an upper bound (F_1=87.80) that suggests room for improvement for this technique, should tagging/lemmatisation performance increase for French. We also analyze the improvement in performance for both techniques with respect to word frequency. We find that replacing word forms with clusters improves attachment performance for words that are originally either unknown or low-frequency, since these words are replaced by cluster symbols that tend to have higher frequencies. Furthermore, clustering also helps significantly for medium to high frequency words, suggesting that training on word clusters leads to better probability estimates for these words.
Complete list of metadatas

Cited literature [20 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00495177
Contributor : Marie Candito <>
Submitted on : Tuesday, September 7, 2010 - 3:49:35 PM
Last modification on : Friday, January 4, 2019 - 5:33:24 PM
Long-term archiving on : Wednesday, December 8, 2010 - 2:22:59 AM

File

SPMRL2010-LemmaClust.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00495177, version 1

Collections

Citation

Marie Candito, Djamé Seddah. Parsing word clusters. NAACL/HLT-2010 Workshop on Statistical Parsing of Morphologically Rich Languages - SPMRL 2010, Jun 2010, Los Angeles, United States. pp.76-84. ⟨hal-00495177⟩

Share

Metrics

Record views

306

Files downloads

151