Skip to Main content Skip to Navigation
New interface
Conference papers

A Data-driven Approach to Named Entity Recognition for Early Modern French

Abstract : Named entity recognition has become an increasingly useful tool for digital humanities research, specially when it comes to historical texts. However, historical texts pose a wide range of challenges to both named entity recognition and natural language processing in general that are still difficult to address even with modern neural methods. In this article we focus in named entity recognition for historical French, and in particular for Early Modern French (16th-18th c.), i.e. Ancien Régime French. However, instead of developing a specialised architecture to tackle the particularities of this state of language, we opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period; we try to annotate as much data as possible producing a corpus that is many times bigger than the most popular NER evaluation corpora for both Contemporary English and French. We then fine-tune existing state-of-the-art architectures for Early Modern and Contemporary French, obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English. Both the corpus and the fine-tuned models are released.
Document type :
Conference papers
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03814449
Contributor : Pedro Ortiz Suarez Connect in order to contact the contributor
Submitted on : Friday, October 14, 2022 - 2:42:52 AM
Last modification on : Friday, October 21, 2022 - 10:47:16 PM

File

COLING_2022_NER.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

  • HAL Id : hal-03814449, version 1

Citation

Pedro Ortiz Suarez, Simon Gabay. A Data-driven Approach to Named Entity Recognition for Early Modern French. Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Oct 2022, Gyeongju, South Korea. ⟨hal-03814449⟩

Share

Metrics

Record views

0

Files downloads

0