POS-tagging for Oral Texts with CRF and Category Decomposition

Abstract : The ESLO (Enquête sociolinguistique d'Orléans, i.e. Sociolinguistic Survey of Orléans) campaign gathered a large oral corpus, which was later transcribed into a text format. The purpose of this work is to assign morpho-syntactic labels to each unit of this corpus. To this end, we first studied the specificities of the labels required for oral data, and their various possible levels of description. This led to a new original hierarchical structure of labels. Then, since our new set of labels was different from any of those of existing taggers, which are usually not fit for oral data, we have built a new labelling tool using a Machine Learning approach. As a starting point, we used data labelled by Cordial and corrected by hand. We used CRF (Conditional Random Fields), to try to take the best possible advantage of the linguistic knowledge used to define the set of labels. We measure accuracy between 85 and 90, depending on the parameters.
Liste complète des métadonnées

Cited literature [11 references]  Display  Hide  Download

Contributor : Jean-Philippe Prost <>
Submitted on : Monday, March 29, 2010 - 4:12:43 PM
Last modification on : Thursday, February 7, 2019 - 3:48:58 PM
Document(s) archivé(s) le : Friday, October 19, 2012 - 10:49:11 AM


Files produced by the author(s)


  • HAL Id : hal-00467951, version 1



Isabelle Tellier, Iris Eshkol, Samer Taalab, Jean-Philippe Prost. POS-tagging for Oral Texts with CRF and Category Decomposition. Research in Computing Science, Instituto Politécnico Nacional, 2010, 46, pp.79--90. 〈http://www.cicling.org/2010/〉. 〈hal-00467951〉



Record views


Files downloads