Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French

Abstract : Language registers are a strongly perceptible characteristic of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach relies on a small initial seed of expert data. After massively retrieving web pages, it iteratively alternates the training of an intermediate classifier and the annotation of new texts to augment the labeled corpus. The approach is applied to the casual, neutral, and formal registers, leading to a 750M word corpus and a final neural classifier with an acceptable performance.
Complete list of metadatas

Cited literature [16 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02064694
Contributor : Gwénolé Lecorvé <>
Submitted on : Tuesday, April 9, 2019 - 11:43:49 AM
Last modification on : Monday, July 15, 2019 - 12:31:14 PM

Identifiers

  • HAL Id : hal-02064694, version 1

Citation

Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, et al.. Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French. International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), Apr 2019, La Rochelle, France. ⟨hal-02064694⟩

Share

Metrics

Record views

184

Files downloads

56