Skip to Main content Skip to Navigation
Conference papers

Text Corpora and the Challenge of Newly Written Languages

Alice Millour 1 Karën Fort 1, 2
2 SEMAGRAMME - Semantic Analysis of Natural Language
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.
Document type :
Conference papers
Complete list of metadata

Cited literature [37 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02611209
Contributor : Alice Millour <>
Submitted on : Monday, May 18, 2020 - 11:44:22 AM
Last modification on : Wednesday, February 3, 2021 - 1:12:42 PM

File

ccurl2020_kfam.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02611209, version 1

Citation

Alice Millour, Karën Fort. Text Corpora and the Challenge of Newly Written Languages. 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), May 2020, Marseille, France. ⟨hal-02611209⟩

Share

Metrics

Record views

68

Files downloads

80