Skip to Main content Skip to Navigation
Conference papers

Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

yann Teytaut 1 Axel Roebel 1 
1 Analyse et synthèse sonores [Paris]
STMS - Sciences et Technologies de la Musique et du Son
Abstract : Phoneme-to-audio alignment is the task of synchronizing voice recordings and their related phonetic transcripts. In this work, we introduce a new system to forced phonetic alignment with Recurrent Neural Networks (RNN). With the Connectionist Temporal Classification (CTC) loss as training objective, and an additional reconstruction cost, we learn to infer relevant perframe phoneme probabilities from which alignment is derived. The core of the neural architecture is a context-aware attention mechanism between mel-spectrograms and side information. We investigate two contexts given by either phoneme sequences (model PHATT) or spectrograms themselves (model SPATT). Evaluations show that these models produce precise alignments for both speaking and singing voice. Best results are obtained with the model PHATT, which outperforms baseline reference with an average imprecision of 16.3ms and 29.8ms on speech and singing, respectively. The model SPATT also appears as an interesting alternative, capable of aligning longer audio files without requiring phoneme sequences on small audio segments.
Document type :
Conference papers
Complete list of metadata
Contributor : Yann TEYTAUT Connect in order to contact the contributor
Submitted on : Tuesday, February 15, 2022 - 5:17:27 PM
Last modification on : Tuesday, March 15, 2022 - 3:33:21 AM
Long-term archiving on: : Monday, May 16, 2022 - 7:07:53 PM


Files produced by the author(s)



yann Teytaut, Axel Roebel. Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice. Proceedings of Interspeech 2021, International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.61-65, ⟨10.21437/interspeech.2021-1676⟩. ⟨hal-03552964⟩



Record views


Files downloads