POS-tagging for Oral Texts with CRF and Category Decomposition

Abstract : The ESLO (Enquête sociolinguistique d'Orléans, i.e. Sociolinguistic Survey of Orléans) campaign gathered a large oral corpus, which was later transcribed into a text format. The purpose of this work is to assign morpho-syntactic labels to each unit of this corpus. To this end, we first studied the specificities of the labels required for oral data, and their various possible levels of description. This led to a new original hierarchical structure of labels. Then, since our new set of labels was different from any of those of existing taggers, which are usually not fit for oral data, we have built a new labelling tool using a Machine Learning approach. As a starting point, we used data labelled by Cordial and corrected by hand. We used CRF (Conditional Random Fields), to try to take the best possible advantage of the linguistic knowledge used to define the set of labels. We measure accuracy between 85 and 90, depending on the parameters.
Document type :
Journal articles
Research in Computing Science, Instituto Politécnico Nacional, 2010, 46, pp.79--90. 〈http://www.cicling.org/2010/〉
Liste complète des métadonnées

Cited literature [11 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00467951
Contributor : Jean-Philippe Prost <>
Submitted on : Monday, March 29, 2010 - 4:12:43 PM
Last modification on : Monday, April 30, 2018 - 10:58:02 AM
Document(s) archivé(s) le : Friday, October 19, 2012 - 10:49:11 AM

File

tellierEtAl2010-cicling.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00467951, version 1

Collections

Citation

Isabelle Tellier, Iris Eshkol, Samer Taalab, Jean-Philippe Prost. POS-tagging for Oral Texts with CRF and Category Decomposition. Research in Computing Science, Instituto Politécnico Nacional, 2010, 46, pp.79--90. 〈http://www.cicling.org/2010/〉. 〈hal-00467951〉

Share

Metrics

Record views

272

Files downloads

138