Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract

Diandra Fabre 1 Thomas Hueber 1 Laurent Girin 1, 2 Xavier Alameda-Pineda 3, 2 Pierre Badin 1
GIPSA-DPC - Département Parole et Cognition
2 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : Visual biofeedback is the process of gaining awareness of physiological functions through the display of visual information. As speech is concerned, visual biofeedback usually consists in showing a speaker his/her own articulatory movements, which has proven useful in applications such as speech therapy or second language learning. This article presents a novel method for automatically animating an articulatory tongue model from ultrasound images. Integrating this model into a virtual talking head enables to overcome the limitations of displaying raw ultrasound images, and provides a more complete and user-friendly feedback by showing not only the tongue, but also the palate, teeth, pharynx, etc. Altogether, these cues are expected to lead to an easier understanding of the tongue movements. Our approach is based on a probabilistic model which converts raw ultrasound images of the vocal tract into control parameters of the articulatory tongue model. We investigated several mapping techniques such as the Gaussian Mixture Regression (GMR), and in particular the Cascaded Gaussian Mixture Regression (C-GMR) techniques, recently proposed in the context of acoustic-articulatory inversion. Both techniques are evaluated on a multispeaker database. The C-GMR consists in the adaptation of a GMR reference model, trained with a large dataset of multimodal articulatory data from a reference speaker, to a new source speaker using a small set of adaptation data recorded during a preliminary enrollment session (system calibration). By using prior information from the reference model, the C-GMR approach is able (i) to maintain good mapping performance while minimizing the amount of adaptation data (and thus limiting the duration of the enrollment session), and (ii) to generalize to articulatory configurations not seen during enrollment better than the GMR approach. As a result, the C-GMR appears to be a good mapping technique for a practical system of visual biofeedback.
Type de document :
Article dans une revue
Speech Communication, Elsevier : North-Holland, 2017, 93, pp.63 - 75. <10.1016/j.specom.2017.08.002>
Liste complète des métadonnées
Contributeur : Thomas Hueber <>
Soumis le : mardi 29 août 2017 - 07:38:45
Dernière modification le : vendredi 8 septembre 2017 - 15:53:33



Diandra Fabre, Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, Pierre Badin. Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication, Elsevier : North-Holland, 2017, 93, pp.63 - 75. <10.1016/j.specom.2017.08.002>. <hal-01578315>



Consultations de la notice