Languages(s) of the SHUN-PAO, a Computational Linguistics account - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Languages(s) of the SHUN-PAO, a Computational Linguistics account

Pierre Magistry

Résumé

This work is part of a broader project which requires adapting information extraction (IE) methods to written materials (mostly press articles) published in China between the mid 19th and the mid 20th centuries. This calls for a better understanding and description of the language(s) we can observe in our sources. More importantly, it is an unprecedented opportunity to provide a usage-based description of written languages as used in the press in Modern China. There is an abundant literature describing this pivotal era from different perspectives and disciplines related to language, including the history of language policies (Kaske, 2008), the socio-linguistic aspects (Weng, 2018) or historical linguistics (Coblin, 2000, Simmons, 2017). However what is presented in this article is, as far as I know, the first usage-based study to leverage a complete corpus of almost 80 years of a daily newspaper, the Shen-Pao(申報), containing about 750 Millions sinograms to account for the actual practices and their evolution through time. In order to do so, I propose new Computational Linguistics methods and tools inspired by recent works in the field, especially Language Modeling and Contextual String Embeddings.
Fichier principal
Vignette du fichier
61_Magistry_DADH2019.pdf (917.58 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02493546 , version 1 (27-02-2020)

Identifiants

  • HAL Id : hal-02493546 , version 1

Citer

Pierre Magistry. Languages(s) of the SHUN-PAO, a Computational Linguistics account. 10th International Conference of Digital Archives and Digital Humanities, Dec 2019, Taipei, Taiwan. ⟨hal-02493546⟩
109 Consultations
96 Téléchargements

Partager

Gmail Facebook X LinkedIn More