A voice conversion method based on joint pitch and spectral envelope transformation
Résumé
Most of the research in Voice Conversion (VC) is devoted to spectral transformation while the conversion of prosodic features is essentially obtained through a simple linear transformation of pitch. These separate transformations lead to an unsatisfactory speech conversion quality, especially when the speaking styles of the source and target speakers are different. In this paper, we propose a method capable of jointly converting pitch and spectral envelope information. The parameters to be transformed are obtained by combining scaled pitch values with the spectral envelope parameters for the voiced frames and only spectral envelope parameters for the unvoiced ones. These parameters are clustered using a Gaussian Mixture Model (GMM). Then the transformation functions are determined using a conditional expectation estimator. Tests carried out show that, this process leads to a satisfactory pitch transformation. Moreover, it makes the spectral envelope transformation more robust.