The French Social Media Bank: a Treebank of Noisy User Generated Content - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

The French Social Media Bank: a Treebank of Noisy User Generated Content

Résumé

In recent years, statistical parsers have reached high performance levels on well-edited texts. Domain adaptation techniques have improved parsing results on text genres differing from the journalistic data most parsers are trained on. However, such corpora usually comply with standard linguistic, spelling and typographic conventions. In the meantime, the emergence of Web 2.0 communication media has caused the apparition of new types of online textual data. Although valuable, e.g., in terms of data mining and sentiment analysis, such user-generated content rarely complies with standard conventions: they are noisy. This prevents most NLP tools, especially treebank based parsers, from performing well on such data. For this reason, we have developed the French Social Media Bank, the first user-generated content treebank for French, a morphologically rich language (MRL). The first release of this resource contains 1,700 sentences from various Web 2.0 sources, including data specifically chosen for their high noisiness. We describe here how we created this treebank and expose the methodology we used for fully annotating it. We also provide baseline POS tagging and statistical constituency parsing results, which are lower by far than usual results on edited texts. This highlights the high difficulty of automatically processing such noisy data in a MRL.
Fichier principal
Vignette du fichier
coling2012.pdf (154.32 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00780895 , version 1 (25-01-2013)

Identifiants

  • HAL Id : hal-00780895 , version 1

Citer

Djamé Seddah, Benoît Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet. The French Social Media Bank: a Treebank of Noisy User Generated Content. COLING 2012 - 24th International Conference on Computational Linguistics, Kay, Martin and Boitet, Christian, Dec 2012, Mumbai, India. ⟨hal-00780895⟩
672 Consultations
722 Téléchargements

Partager

Gmail Facebook X LinkedIn More