Improving generative statistical parsing with semi-supervised word clustering

Marie Candito; Benoît Crabbé

Communication Dans Un Congrès Année : 2009

Improving generative statistical parsing with semi-supervised word clustering

(1) , (1)

Marie Candito

Fonction : Auteur
PersonId : 13596
IdHAL : marie-candito
IdRef : 153698616

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Benoît Crabbé

Fonction : Auteur
PersonId : 6726
IdHAL : benoit-crabbe
ORCID : 0000-0002-0821-0913
IdRef : 168451107

Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing

Résumé

We present a semi-supervised method to improve statistical parsing performance. We focus on the well-known problem of lexical data sparseness and present experiments of word clustering prior to parsing. We use a combination of lexicon-aided morphological clustering that preserves tagging ambiguity, and unsupervised word clustering, trained on a large unannotated corpus. We apply these clusterings to the French Treebank, and we train a parser with the PCFG-LA unlexicalized algorithm of Petrov et al. (2006). We find a gain in French parsing performance: from a baseline of F1=86.76% to F1=87.37% using morphological clustering, and up to F1=88.29% using further unsupervised clustering. This is the best known score for French probabilistic parsing. These preliminary results are encouraging for statistically parsing morphologically rich languages, and languages with small amount of annotated data.

Mots clés

statistical parsing unsupervised word clustering

Domaines

Informatique et langage [cs.CL] Intelligence artificielle [cs.AI] Traitement du texte et du document

Fichier principal

IWPT09-clustering.pdf (41.96 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marie Candito : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00495267

Soumis le : mardi 7 septembre 2010-15:46:14

Dernière modification le : mercredi 26 octobre 2022-17:37:14

Archivage à long terme le : mercredi 8 décembre 2010-02:26:38

Dates et versions

hal-00495267 , version 1 (07-09-2010)

Identifiants

HAL Id : hal-00495267 , version 1

Citer

Marie Candito, Benoît Crabbé. Improving generative statistical parsing with semi-supervised word clustering. 11th International Conference on Parsing Technologies - IWPT'09, Oct 2009, Paris, France. pp.169-172. ⟨hal-00495267⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS7 INRIA INRIA2 CAMPUS-AAR AAI ANR

286 Consultations

187 Téléchargements

Improving generative statistical parsing with semi-supervised word clustering

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager