Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis

Maël Pouget; Olha Nahorna; Thomas Hueber; Gérard Bailly

doi:10.21437/Interspeech.2016-165

Communication Dans Un Congrès Année : 2016

Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis

(1) , (1) , (1) , (1)

Maël Pouget

Fonction : Auteur

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Olha Nahorna

Fonction : Auteur
PersonId : 915695

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Thomas Hueber

Fonction : Auteur
PersonId : 5965
IdHAL : thomas-hueber
ORCID : 0000-0002-8296-5177
IdRef : 143151568

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Gérard Bailly

Fonction : Auteur
PersonId : 444
IdHAL : gerard-bailly
ORCID : 0000-0002-6053-0818
IdRef : 033792135

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Résumé

Incremental text-to-speech systems aim at synthesizing a text 'on-the-fly', while the user is typing a sentence. In this context, this article addresses the problem of the part-of-speech tagging (POS, i.e. lexical category) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation. Here, the main challenge is to estimate the POS of a given word without knowing its 'right context' (i.e. the following words which are not available yet). To address this issue, we propose a method based on a set of decision trees estimating online whether a given POS tag is likely to be modified when more right-contextual information becomes available. In such a case, the synthesis is delayed until POS stability is guaranteed. This results in delivering the synthetic voice in word chunks of variable length. Objective evaluation on French shows that the proposed method is able to estimate POS tags with more than a 92% accuracy (compared to a non-incremental system) while minimizing the synthesis latency (between 1 and 4 words). Perceptual evaluation (ranking test) is then carried in the context of HMM-based speech synthesis. Experimental results show that the word grouping resulting from the proposed method is rated more acceptable than word-byword incremental synthesis.

Mots clés

Incremental speech synthesis natural language processing TTS classification part-of-speech

Domaines

Traitement du signal et de l'image [eess.SP] Machine Learning [stat.ML]

Fichier principal

interspeech-2016-itts (1).pdf (260.07 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Maël Pouget : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01374782

Soumis le : samedi 1 octobre 2016-12:31:33

Dernière modification le : jeudi 4 avril 2024-18:21:46

Archivage à long terme le : lundi 2 janvier 2017-12:52:06

Dates et versions

hal-01374782 , version 1 (01-10-2016)

Identifiants

HAL Id : hal-01374782 , version 1
DOI : 10.21437/Interspeech.2016-165

Citer

Maël Pouget, Olha Nahorna, Thomas Hueber, Gérard Bailly. Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis. Interspeech 2016 - 17th Annual Conference of the International Speech Communication Association, Sep 2016, San Francisco, CA, United States. pp.2846 - 2850, ⟨10.21437/Interspeech.2016-165⟩. ⟨hal-01374782⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS GIPSA GIPSA-DPC GIPSA-CRISSP

344 Consultations

168 Téléchargements

Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager