HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

The French Social Media Bank: a Treebank of Noisy User Generated Content

Abstract : In recent years, statistical parsers have reached high performance levels on well-edited texts. Domain adaptation techniques have improved parsing results on text genres differing from the journalistic data most parsers are trained on. However, such corpora usually comply with standard linguistic, spelling and typographic conventions. In the meantime, the emergence of Web 2.0 communication media has caused the apparition of new types of online textual data. Although valuable, e.g., in terms of data mining and sentiment analysis, such user-generated content rarely complies with standard conventions: they are noisy. This prevents most NLP tools, especially treebank based parsers, from performing well on such data. For this reason, we have developed the French Social Media Bank, the first user-generated content treebank for French, a morphologically rich language (MRL). The first release of this resource contains 1,700 sentences from various Web 2.0 sources, including data specifically chosen for their high noisiness. We describe here how we created this treebank and expose the methodology we used for fully annotating it. We also provide baseline POS tagging and statistical constituency parsing results, which are lower by far than usual results on edited texts. This highlights the high difficulty of automatically processing such noisy data in a MRL.
Document type :
Conference papers
Complete list of metadata

Cited literature [34 references]  Display  Hide  Download

Contributor : Djamé Seddah Connect in order to contact the contributor
Submitted on : Friday, January 25, 2013 - 12:56:51 AM
Last modification on : Friday, January 21, 2022 - 3:22:25 AM
Long-term archiving on: : Friday, April 26, 2013 - 3:54:56 AM


Files produced by the author(s)


  • HAL Id : hal-00780895, version 1


Djamé Seddah, Benoît Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet. The French Social Media Bank: a Treebank of Noisy User Generated Content. COLING 2012 - 24th International Conference on Computational Linguistics, Kay, Martin and Boitet, Christian, Dec 2012, Mumbai, India. ⟨hal-00780895⟩



Record views


Files downloads