Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

Yann Teytaut; Axel Roebel

doi:10.21437/interspeech.2021-1676

Communication Dans Un Congrès Année : 2021

Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

(1) , (1)

Yann Teytaut

Fonction : Auteur
PersonId : 1125039

Analyse et synthèse sonores [Paris]

Axel Roebel

Fonction : Auteur
PersonId : 4527
IdHAL : axel-roebel
ORCID : 0000-0001-6136-4391
IdRef : 227186079

Analyse et synthèse sonores [Paris]

Résumé

Phoneme-to-audio alignment is the task of synchronizing voice recordings and their related phonetic transcripts. In this work, we introduce a new system to forced phonetic alignment with Recurrent Neural Networks (RNN). With the Connectionist Temporal Classification (CTC) loss as training objective, and an additional reconstruction cost, we learn to infer relevant perframe phoneme probabilities from which alignment is derived. The core of the neural architecture is a context-aware attention mechanism between mel-spectrograms and side information. We investigate two contexts given by either phoneme sequences (model PHATT) or spectrograms themselves (model SPATT). Evaluations show that these models produce precise alignments for both speaking and singing voice. Best results are obtained with the model PHATT, which outperforms baseline reference with an average imprecision of 16.3ms and 29.8ms on speech and singing, respectively. The model SPATT also appears as an interesting alternative, capable of aligning longer audio files without requiring phoneme sequences on small audio segments.

Mots clés

phoneme-to-audio alignment recurrent neural network Connectionist Temporal Classification voice analysis

Domaines

Informatique et langage [cs.CL]

Fichier principal

1676anav.pdf (421.96 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Yann TEYTAUT : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03552964

Soumis le : mardi 15 février 2022-17:17:27

Dernière modification le : samedi 7 octobre 2023-21:36:22

Archivage à long terme le : lundi 16 mai 2022-19:07:53

Dates et versions

hal-03552964 , version 1 (15-02-2022)

Identifiants

HAL Id : hal-03552964 , version 1
DOI : 10.21437/interspeech.2021-1676

Citer

Yann Teytaut, Axel Roebel. Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice. Proceedings of Interspeech 2021, International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.61-65, ⟨10.21437/interspeech.2021-1676⟩. ⟨hal-03552964⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS IRCAM STMS SORBONNE-UNIVERSITE SU-SCIENCES ANR

332 Consultations

830 Téléchargements

Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager