Searching for Discriminative Metadata of Heterogenous Corpora

Abstract : In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain , and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology .
Complete list of metadatas

Cited literature [12 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01250981
Contributor : Sophie Prevost <>
Submitted on : Tuesday, January 5, 2016 - 2:47:49 PM
Last modification on : Wednesday, May 22, 2019 - 3:46:02 PM
Long-term archiving on : Thursday, April 7, 2016 - 3:25:36 PM

File

guibon_al_TLT2015.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01250981, version 1

Collections

Citation

Gaël Guibon, Isabelle Tellier, Sophie Prévost, Mathieu Constant, Kim Gerdes. Searching for Discriminative Metadata of Heterogenous Corpora. Fourteenth International Workshop on Treebanks and Linguistic Theories (TLT14), Dec 2015, Varsovie, Poland. pp.72-82. ⟨hal-01250981⟩

Share

Metrics

Record views

249

Files downloads

168