Phone-Level Embeddings for Unit Selection Speech Synthesis

Antoine Perquin 1 Gwénolé Lecorvé 1 Damien Lolive 1 Laurent Amsaleg 2
1 EXPRESSION - Expressiveness in Human Centered Data/Media
UBS - Université de Bretagne Sud, IRISA-D6 - MEDIA ET INTERACTIONS
2 LinkMedia - Creating and exploiting explicit links between multimedia fragments
Inria Rennes – Bretagne Atlantique , IRISA_D6 - MEDIA ET INTERACTIONS
Abstract : Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean distance in the embedding space. This work is conducted on a French corpus of an 11 hours audiobook. Perceptual tests show the produced speech is preferred over a unit selection method where the target cost is defined by an expert. They also show that the embeddings are general enough to be used for different speech styles without quality loss. Furthermore, objective measures and a perceptual test on statistical parametric speech synthesis show that our models perform comparably to state-of-the-art models for parametric signal generation, in spite of necessary simplifications, namely late time integration and information compression.
Type de document :
Communication dans un congrès
SLSP 2018 - 6th International Conference on Statistical Language and Speech Processing, Oct 2018, Mons, Belgium. pp.1-11
Liste complète des métadonnées
Contributeur : Antoine Perquin <>
Soumis le : lundi 16 juillet 2018 - 17:05:05
Dernière modification le : vendredi 7 septembre 2018 - 09:23:58
Document(s) archivé(s) le : mercredi 17 octobre 2018 - 16:24:54


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-01840812, version 1


Antoine Perquin, Gwénolé Lecorvé, Damien Lolive, Laurent Amsaleg. Phone-Level Embeddings for Unit Selection Speech Synthesis. SLSP 2018 - 6th International Conference on Statistical Language and Speech Processing, Oct 2018, Mons, Belgium. pp.1-11. 〈hal-01840812〉



Consultations de la notice


Téléchargements de fichiers