Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, EpiSciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Skip to Main content Skip to Navigation
Journal articles

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Frederik Bous 1 Axel Roebel 1 
1 Analyse et synthèse sonores [Paris]
STMS - Sciences et Technologies de la Musique et du Son
Abstract : In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f0 falls out of the range of the source speaker/singer. Using the mean f0 error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f0 from the auto-encoder’s latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f0 transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.
Complete list of metadata
Contributor : Axel Roebel Connect in order to contact the contributor
Submitted on : Sunday, March 6, 2022 - 4:57:38 PM
Last modification on : Tuesday, March 15, 2022 - 3:22:44 AM

Links full text



Frederik Bous, Axel Roebel. A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice. Information, MDPI, 2022, 13 (3), pp.102. ⟨10.3390/info13030102⟩. ⟨hal-03599085⟩



Record views