Predicting F0 and voicing from NAM-captured whispered speech

Viet-Anh Tran; Gérard Bailly; Hélène Loevenbruck; Tomoki Toda

Communication Dans Un Congrès Année : 2008

Predicting F0 and voicing from NAM-captured whispered speech

(1) , (2) , (3) , (4)

1
2
3
4

Viet-Anh Tran

Fonction : Auteur
PersonId : 907457

Grenoble Images Parole Signal Automatique

Gérard Bailly

Fonction : Auteur
PersonId : 444
IdHAL : gerard-bailly
ORCID : 0000-0002-6053-0818
IdRef : 033792135

GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face

Hélène Loevenbruck

Fonction : Auteur
PersonId : 1957
IdHAL : helene-loevenbruck
ORCID : 0000-0003-4015-3565
IdRef : 099327627

GIPSA - Parole, Multimodalité, Développement

Tomoki Toda

Fonction : Auteur

Speech and Acoustics Processing Laboratory

Résumé

The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient, mainly due to the difficulty in estimating F0 of the transformed voice from unvoiced speech. In this paper, we propose a method to improve F0 estimation and voicing decision in a NAM-to-speech conversion system based on Gaussian Mixture Models (GMM) applied to whispered speech. Instead of combining voicing decision and F0 estimation in a single GMM, a simple feed-forward neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments. The error rate for the voiced/unvoiced decision of the network is 6.8% compared to 9.2% with the original system. Our proposal benefits also to F0 estimation error.

Domaines

Traitement du signal et de l'image [eess.SP] Traitement du signal et de l'image [eess.SP]

Fichier principal

vat_SP08.pdf (739.81 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Gérard Bailly : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00333290

Soumis le : mercredi 22 octobre 2008-20:23:50

Dernière modification le : jeudi 4 avril 2024-21:18:17

Archivage à long terme le : lundi 7 juin 2010-21:22:02

Dates et versions

hal-00333290 , version 1 (22-10-2008)

Identifiants

HAL Id : hal-00333290 , version 1

Citer

Viet-Anh Tran, Gérard Bailly, Hélène Loevenbruck, Tomoki Toda. Predicting F0 and voicing from NAM-captured whispered speech. Speech Prosody 2008 - 4th International Conference on Speech Prosody, May 2008, Campinas, Brazil. pp.107-110. ⟨hal-00333290⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS GIPSA GIPSA-DPC GIPSA-MPACIF GIPSA-PMD

233 Consultations

130 Téléchargements

Predicting F0 and voicing from NAM-captured whispered speech

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager