Skip to Main content Skip to Navigation
Conference papers

Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement

Abstract : Contextual word embeddings have become ubiquitous in Natural Language Processing. Until recently,most available models were trained on English data or on the concatenation of corpora in multiplelanguages. This made the practical use of models in all languages except English very limited.The recent release of monolingual versions of BERT (Devlinet al., 2019) for French establisheda new state-of-the-art for all evaluated tasks. In this paper, based on experiments on CamemBERT(Martinet al., 2019), we show that pretraining such models on highly variable datasets leads to betterdownstream performance compared to models trained on more uniform data. Moreover, we show thata relatively small amount of web crawled data (4GB) leads to downstream performances as good as amodel pretrained on a corpus two orders of magnitude larger (138GB)
Document type :
Conference papers
Complete list of metadata

Cited literature [44 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02784755
Contributor : Sylvain Pogodalla <>
Submitted on : Tuesday, June 23, 2020 - 11:59:03 AM
Last modification on : Tuesday, March 9, 2021 - 8:42:46 AM

File

151.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-02784755, version 3

Citation

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoan Dupont, Laurent Romary, et al.. Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement. JEP-TALN-RECITAL 2020 - 33ème Journées d’Études sur la Parole, 27ème Conférence sur le Traitement Automatique des Langues Naturelles, 22ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, Jun 2020, Nancy / Virtuel, France. pp.54-65. ⟨hal-02784755v3⟩

Share

Metrics

Record views

291

Files downloads

1042