Synthesis and expressive transformation of singing voice

Luc Ardaillon

Résumé

State-of-the-art singing voice synthesis systems are already able to synthesize voices with a reasonable quality, allowing their use in musical productions. But much efforts are still necessary to obtain a quality similar to that of a real professional singer. This thesis aimed at conducting research on the synthesis and expressive transformations of the singing voice, towards the development of a high-quality synthesizer that can generate a natural and expressive singing voice automatically from a given score and lyrics. Due to the important variability of the voice signal, both from the control and timbral point of views, this involves considering various aspects. Mainly 3 research directions can be identified: the methods for modelling the voice signal to automatically generate an intelligible and natural-sounding voice according to the given lyrics; the control of the synthesis to render an adequate interpretation of a given score while conveying some expressivity related to a specific singing style; the transformation of the voice signal to improve its naturalness and add expressivity by varying the timbre adequately according to the pitch, intensity and voice quality. This thesis provides some contributions in each of those 3 directions. First, a fully-functional synthesis system has been developed, based on di- phones concatenations, which we assume to be up to now the approach capable of providing the highest sound quality. The modular architecture of this system allows to integrate and compare different signal modeling approaches. Then, the question of the control is addressed, encompassing the automatic gen- eration of the f0, intensity, and phonemes durations. A particular limit of state- of-the-art approaches is a lack of controls provided to the composer to shape the expression of the synthesized voice. To tackle this issue, an important contribution of this thesis has been the development of a new parametric f0 model with intu- itive controls. The modeling of specific singing styles has also been addressed by learning the expressive variations of the modeled control parameters on commer- cial recordings of famous singers to apply them to the synthesis of new scores. Finally, some investigations on expressive timbre transformations have been con- ducted, for a future integration into our synthesizer. This mainly concerns methods related to intensity transformation, considering the effects of both the glottal source and vocal tract, and the modeling of vocal roughness.

Les systèmes de synthèse de voix chantée actuels sont déjà capables de synthétiser des voix avec une qualité raisonnable, permettant une utilisation dans le cadre de productions musicales. Mais beaucoup d’efforts sont encore nécessaires afin d’obtenir une qualité comparable à celle d’un réel chanteur professionnel. Le but de cette thèse était de conduire des recherches sur la synthèse et transformation expressive de voix chantée, en vue de pouvoir développer un synthétiseur de haute qualité capable de générer automatiquement un chant naturel et expressif à partir d’une partition et d’un texte donnés. Du fait de la grande variabilité du signal vocal, tant du point de vue de son contrôle que de son timbre, cela implique de considérer des aspects variés. 3 directions de recherches principales peuvent être identifiées: les méthodes de modélisation du signal afin de générer automatiquement une voix intelligible et naturelle à partir d’un texte donné; le contrôle de la synthèse, afin de produire une interprétation d’une partition donnée tout en transmettant une certaine expressivité liée à un style de chant spécifique; la transformation du signal vocal afin de le rendre plus naturel et plus expressif, en faisant varier le timbre en adéquation avec la hauteur, l’intensité et la qualité vocale. Cette thèse apporte diverses contributions dans chacune de ces 3 directions. Tout d’abord, un système de synthèse complet a été développé, basé sur la con- caténation de diphones, que nous supposons être jusqu’à aujourd’hui l’approche capable de produire les résultats de la plus haute qualité. L’architecture modulaire de ce système permet d’intégrer et de comparer différent modèles de signaux. Ensuite, la question du contrôle est abordée, comprenant la génération automatique de la f0, de l’intensité, et des durées des phonèmes. Une limite particulières des approches de l’état de l’art est le manque de contrôles fournis au compositeur pour modifier l’expression de la voix synthétisée. Afin de résoudre ce problème, une im- portante contribution de cette thèse a été le développement d’un nouveau modèle de f0 paramétrique intégrant des contrôles intuitifs. La modélisation de styles de chant spécifiques a également été abordée par l’apprentissage des variations expressives des paramètres de contrôle modélisés à partir d’enregistrements commerciaux de chanteurs célèbres, afin de les appliquer à la synthèse de nouvelles partitions. Enfin, des investigations sur diverses transformations expressives du timbre ont été conduites, en vue d’une future intégration dans notre synthétiseur. Cela concerne principalement des méthodes liées à la transformation de l’intensité, considérant les effets liés à la source glottique et au conduit vocal, et la modélisation de la raucité vocale.

Synthesis and expressive transformation of singing voice

Synthèse et transformation expressive de la voix chantée

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager