Repeat after Me: Self-Supervised Learning of Acoustic-to-Articulatory Mapping by Vocal Imitation

We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory commands from the acoustic speech input. Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers. The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.

Mots clés

speech production computational models articulatory synthesis representation learning

Domaines

Son [cs.SD] Informatique et langage [cs.CL]

Fichier principal

ICASSP_FINAL.pdf (1.03 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Marc-Antoine Georges : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03688189

Soumis le : vendredi 3 juin 2022-18:31:23

Dernière modification le : jeudi 4 avril 2024-20:51:48

Archivage à long terme le : dimanche 4 septembre 2022-19:58:26

Dates et versions

hal-03688189 , version 1 (03-06-2022)

Identifiants

HAL Id : hal-03688189 , version 1
ARXIV : 2204.02269
DOI : 10.1109/ICASSP43922.2022.9747804

Citer

Marc-Antoine Georges, Julien Diard, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber. Repeat after Me: Self-Supervised Learning of Acoustic-to-Articulatory Mapping by Vocal Imitation. ICASSP 2022 - IEEE International Conference on Acoustics, Speech and Signal Processing, May 2022, Singapore, Singapore. pp.8252-8256, ⟨10.1109/ICASSP43922.2022.9747804⟩. ⟨hal-03688189⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-SAVOIE UGA CNRS GIPSA GIPSA-PCMD LPNC GIPSA-CRISSP GIPSA-PPC MIAI ANR

89 Consultations

97 Téléchargements