Using ASR-Generated Text for Spoken Language Modeling

This papers aims at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute 1) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT 2) or through training a LM from scratch. The new models (FlauBERT-Oral) are shared with the community 3 and are evaluated not only in terms of word prediction accuracy but also for two downstream tasks: classification of TV shows and syntactic parsing of speech. Experimental results show that FlauBERT-Oral is better than its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-Generated text can be useful to improve spoken language modeling.

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Apprentissage [cs.LG] Traitement du texte et du document Traitement du signal et de l'image [eess.SP]

Fichier principal

2022.bigscience-1.2.pdf (210.59 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Valentin Pelloin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03770460

Soumis le : mardi 6 septembre 2022-14:37:20

Dernière modification le : vendredi 22 mars 2024-18:24:04

Archivage à long terme le : mercredi 7 décembre 2022-18:42:50

Dates et versions

hal-03770460 , version 1 (06-09-2022)

Identifiants

HAL Id : hal-03770460 , version 1
DOI : 10.18653/v1/2022.bigscience-1.2

Citer

Nicolas Hervé, Valentin Pelloin, Benoît Favre, Franck Dary, Antoine Laurent, et al.. Using ASR-Generated Text for Spoken Language Modeling. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, May 2022, virtual+Dublin, France. pp.17-25, ⟨10.18653/v1/2022.bigscience-1.2⟩. ⟨hal-03770460⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU UNIV-LEMANS LIUM LIUM-LST LIS-LAB ANR INCIAM

41 Consultations

36 Téléchargements