C. Viard-gaudin, P. M. Lallican, S. Knerr, and P. Binter, The ireste on/off (ironoff) dual handwriting database, Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on, pp.455-458, 1999.

I. Goodfellow, Y. Bengio, A. Courville, and D. Learning, , 2016.

D. P. Kingma and M. Welling, Auto-encoding variational bayes, 2013.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in neural information processing systems, pp.2672-2680, 2014.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning phrase representations using rnn encoder-decoder for statistical machine translation, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

I. Sutskever, J. Martens, and G. E. Hinton, Generating text with recurrent neural networks, Proceedings of the 28th International Conference on Machine Learning, pp.1017-1024, 2011.

I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, Proceedings of the 27th International Conference on Neural Information Processing Systems, vol.2, pp.3104-3112, 2014.

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3128-3137, 2015.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference, pp.3156-3164, 2015.

J. Briot and F. Pachet, Music generation by deep learning-challenges and directions, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01660753

A. V. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals et al., Wavenet: A generative model for raw audio, 2016.

C. M. Bishop, Mixture density networks, 1994.

A. Graves, Generating sequences with recurrent neural networks, CoRR, 2013.

U. Marti and H. Bunke, A full english sentence database for off-line handwriting recognition, Document Analysis and Recognition, 1999. ICDAR'99. Proceedings of the Fifth International Conference on, pp.705-708, 1999.

L. Theis and M. Bethge, Generative image modeling using spatial lstms, Proceedings of the 28th International Conference on Neural Information Processing Systems, vol.2, pp.1927-1935, 2015.

A. Van-den, N. Oord, K. Kalchbrenner, and . Kavukcuoglu, Pixel recurrent neural networks, Proceedings of the 33rd International Conference on International Conference on Machine Learning, vol.48, pp.1747-1756, 2016.

P. Koehn, Statistical Machine Translation, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01433972

K. Papineni, S. Roukos, T. Ward, and W. Zhu, Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40th annual meeting on association for computational linguistics, pp.311-318, 2002.

S. Banerjee and A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp.65-72, 2005.

R. Vedantam, C. L. Zitnick, and D. Parikh, Cider: Consensusbased image description evaluation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.4566-4575, 2015.

H. Freeman, On the encoding of arbitrary geometric configurations, IRE Transactions on Electronic Computers, vol.2, pp.260-268, 1961.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin, vol.1, issue.6, pp.80-83, 1945.

L. V. Maaten and G. Hinton, Visualizing data using t-sne, Journal of machine learning research, vol.9, pp.2579-2605, 2008.

F. Schroff, D. Kalenichenko, and J. Philbin, Facenet: A unified embedding for face recognition and clustering, CoRR, 2015.

H. Bredin, Tristounet: Triplet loss for speaker turn embedding, CoRR, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01830421

R. J. Skerry-ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton et al., Towards end-to-end prosody transfer for expressive speech synthesis with tacotron, CoRR, 2018.

Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-ryan, E. Battenberg et al., Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, CoRR, 2018.