Skip to Main content Skip to Navigation

Analyse automatique de documents anciens : tirer parti d’un corpus incomplet, hétérogène et bruité

Abstract : In this article we try to tackle some problems arising with noisy and heterogeneous data in the domain of digital humanities. We investigate a corpus known as the mazarinades corpus which gathers around 5,500 documents in French from the 17th century. First of all, we show that this set of documents is not strictly speaking a corpus since its coverage has not been thoroughly defined. Then, we advocate that it is possible to get interesting results even in the case of such an incomplete, heterogeneous and noisy dataset by strictly limiting the amount of pre-treatments necessary fro processing texts. Finally, we present some results on a case study on document dating where we aim to complete missing metadata in the mazarinades corpus. We exploit a method based on character strings analysis which is robust to noisy data and can even take advantage of this noise for improving the quality of the results.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-02467535
Contributor : Gaël Lejeune <>
Submitted on : Wednesday, February 5, 2020 - 9:30:34 AM
Last modification on : Monday, March 2, 2020 - 6:24:48 PM

Links full text

Identifiers

Citation

Karine Abiven, Gaël Lejeune. Analyse automatique de documents anciens : tirer parti d’un corpus incomplet, hétérogène et bruité. Recherche d’Information, Document et Web Sémantique, ISTE OpenScience, 2019, 2 (1), ⟨10.21494/ISTE.OP.2019.0335⟩. ⟨hal-02467535⟩

Share

Metrics

Record views

18