Skip to Main content Skip to Navigation
Journal articles

Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Abstract : Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.
Complete list of metadatas

Cited literature [31 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02154122
Contributor : Thibault Clérice <>
Submitted on : Sunday, April 5, 2020 - 9:24:32 AM
Last modification on : Wednesday, September 23, 2020 - 3:16:24 AM

File

article.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution - ShareAlike 4.0 International License

Identifiers

  • HAL Id : hal-02154122, version 2

Citation

Thibault Clérice. Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin. Journal of Data Mining and Digital Humanities, Episciences.org, 2020, 2020. ⟨hal-02154122v2⟩

Share

Metrics

Record views

132

Files downloads

390