ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks

We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute 1) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT 2) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Apprentissage [cs.LG] Traitement du texte et du document Traitement du signal et de l'image [eess.SP]

Fichier principal

2207.01893.pdf (183.55 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Valentin Pelloin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03770506

Soumis le : mardi 6 septembre 2022-15:07:15

Dernière modification le : vendredi 22 mars 2024-18:24:04

Archivage à long terme le : mercredi 7 décembre 2022-18:45:21

Dates et versions

hal-03770506 , version 1 (06-09-2022)

Identifiants

HAL Id : hal-03770506 , version 1
ARXIV : 2207.01893

Citer

Valentin Pelloin, Franck Dary, Nicolas Hervé, Benoît Favre, Nathalie Camelin, et al.. ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks. Interspeech 2022, Sep 2022, Incheon, South Korea. ⟨hal-03770506⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU UNIV-LEMANS LIUM LIUM-LST LIS-LAB ANR INCIAM

104 Consultations

143 Téléchargements