Languages(s) of the SHUN-PAO,  a Computational Linguistics account

Pierre Magistry

Communication Dans Un Congrès Année : 2019

Languages(s) of the SHUN-PAO, a Computational Linguistics account

(1)

Pierre Magistry

Fonction : Auteur
PersonId : 12984
IdHAL : pierre-magistry
IdRef : 177448148

Institut de recherches Asiatiques

Résumé

This work is part of a broader project which requires adapting information extraction (IE) methods to written materials (mostly press articles) published in China between the mid 19th and the mid 20th centuries. This calls for a better understanding and description of the language(s) we can observe in our sources. More importantly, it is an unprecedented opportunity to provide a usage-based description of written languages as used in the press in Modern China. There is an abundant literature describing this pivotal era from different perspectives and disciplines related to language, including the history of language policies (Kaske, 2008), the socio-linguistic aspects (Weng, 2018) or historical linguistics (Coblin, 2000, Simmons, 2017). However what is presented in this article is, as far as I know, the first usage-based study to leverage a complete corpus of almost 80 years of a daily newspaper, the Shen-Pao(申報), containing about 750 Millions sinograms to account for the actual practices and their evolution through time. In order to do so, I propose new Computational Linguistics methods and tools inspired by recent works in the field, especially Language Modeling and Contextual String Embeddings.

Mots clés

Language change Language Modeling Lexical statistics Contextual Embeddings Modern China

Domaines

Linguistique Linguistique Traitement du texte et du document

Fichier principal

61_Magistry_DADH2019.pdf (917.58 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Magistry : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02493546

Soumis le : jeudi 27 février 2020-21:09:39

Dernière modification le : mercredi 31 janvier 2024-14:06:05

Dates et versions

hal-02493546 , version 1 (27-02-2020)

Identifiants

HAL Id : hal-02493546 , version 1

Citer

Pierre Magistry. Languages(s) of the SHUN-PAO, a Computational Linguistics account. 10th International Conference of Digital Archives and Digital Humanities, Dec 2019, Taipei, Taiwan. ⟨hal-02493546⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UNIV-AMU CAMPUS-AAR AAI IRASIA ASIES_ET_PACIFIQUE

109 Consultations

96 Téléchargements