Speaker-Adaptive Acoustic-Articulatory Inversion using Cascaded Gaussian Mixture Regression

This paper addresses the adaptation of an acoustic-articulatory model of a reference speaker to the voice of another speaker, using a limited amount of audio-only data. In the context of pronunciation training, a virtual talking head displaying the internal speech articulators (e.g., the tongue) could be automatically animated by means of such a model using only the speaker's voice. In this study, the articulatory-acoustic relationship of the reference speaker is modeled by a gaussian mixture model (GMM). To address the speaker adaptation problem, we propose a new framework called cascaded Gaussian mixture regression (C-GMR), and derive two implementations. The first one, referred to as Split-C-GMR, is a straightforward chaining of two distinct GMRs: one mapping the acoustic features of the source speaker into the acoustic space of the reference speaker, and the other estimating the articulatory trajectories with the reference model. In the second implementation, referred to as Integrated-C-GMR, the two mapping steps are tied together in a single probabilistic model. For this latter model, we present the full derivation of the exact EM training algorithm, that explicitly exploits the missing data methodology of machine learning. Other adaptation schemes based on maximum-a posteriori (MAP), maximum likelihood linear regression (MLLR) and direct cross-speaker acoustic-to-articulatory GMR are also investigated. Experiments conducted on two speakers for different amount of adaptation data show the interest of the proposed C-GMR techniques.

Mots clés

Acoustic-articulatory inversion EM algorithm Gaussian mixture regression pronunciation training speaker adaptation speech production talking head

Domaines

Machine Learning [stat.ML] Traitement du signal et de l'image [eess.SP]

Thomas Hueber : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01231197

Soumis le : jeudi 19 novembre 2015-16:25:12

Dernière modification le : jeudi 4 avril 2024-21:25:10

Dates et versions

hal-01231197 , version 1 (19-11-2015)

Identifiants

HAL Id : hal-01231197 , version 1
DOI : 10.1109/TASLP.2015.2464702

Citer

Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, Gérard Bailly. Speaker-Adaptive Acoustic-Articulatory Inversion using Cascaded Gaussian Mixture Regression. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23 (12), pp.2246-2259. ⟨10.1109/TASLP.2015.2464702⟩. ⟨hal-01231197⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA GIPSA GIPSA-DPC LJK LJK_GI LJK_GI_PERCEPTION GIPSA-CRISSP INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

293 Consultations

0 Téléchargements