N. , M. R. Bittner, A. Kumar, T. Weyde, A. Jansson et al., Singing voice separation with deep u-net convolutional networks, Proc. of ISMIR (International Society for Music Information Retrieval), 2017.

P. Chandna, M. Miron, J. Janer, and E. Gómez, Monoaural audio source separation using deep convolutional neural networks, Proc. of LVA/ICA (International Conference on Latent Variable Analysis and Signal Separation), 2017.

A. Cohen-hadria, A. Roebel, and G. Peeters, Improving singing voice separation using deep u-net and waveu-net with data augmentation, 2019.

H. Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin et al., Modulating early visual processing by language, Proc. of NIPS (Annual Conference on Neural Information Processing Systems), 2017.
URL : https://hal.archives-ouvertes.fr/hal-01648683

V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. Vries et al., Feature-wise transformations. Distill, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01841985

C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon et al., Onsets and frames: Dual-objective piano transcription, Proc. of ISMIR (International Society for Music Information Retrieval), 2018.

C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne et al., An improved relative self-attention mechanism for transformer with application to music generation, 2018.

P. Huang, M. Kim, M. Hasegawa-johnson, and P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM TASLP (Transactions on Audio Speech and Language Processing), vol.23, issue.12, 2015.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc. of ICML (International Conference on Machine Learning), 2015.

H. Kameoka, L. Li, S. Inoue, and S. Makino, Semi-blind source separation with multichannel variational autoencoder, 2018.

T. Kim, I. Song, and Y. Bengio, Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition, CoRR, 2017.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, Proc. of ICLR (International Conference on Learning Representations), 2014.

F. Mayer, D. Williamson, P. Mowlaee, and D. Wang, Impact of phase estimation on single-channel speech separation based on time-frequency masking, The Journal of the Acoustical Society of America, vol.141, pp.4668-4679, 2017.

E. Perez, F. Strub, H. Vries, V. Dumoulin, and A. C. Courville, Film: Visual reasoning with a general conditioning layer, Proc. of AAAI (Conference on Artificial Intelligence), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01648685

C. Raffel, B. Mcfee, E. J. Humphrey, J. Salamon, O. Nieto et al., mir eval: a transparent implementation of common mir metrics, Proc. of ISMIR (International Society for Music Information Retrieval), 2014.

Z. Rafii, A. Liutkus, F. Stöter, S. Mimilakis, D. Fitzgerald et al., An Overview of Lead and Accompaniment Separation in Music, IEEE/ACM TASLP (Transactions on Audio Speech and Language Processing), vol.26, issue.8, 2018.
URL : https://hal.archives-ouvertes.fr/lirmm-01766781

Z. Rafii, A. Liutkus, and F. Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The MUSDB18 corpus for music separation, 2017.

O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, Proc. of MICCAI (International Conference on Medical Image Computing and Computer Assisted Intervention), 2015.

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly et al., Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions, Proc. of ICASSP (International Conference on Acoustics, Speech and Signal Processing, 2018.

D. Stoller, S. Ewert, and S. Dixon, Wave-u-net: A multiscale neural network for end-to-end audio source separation, Proc. of ISMIR (International Society for Music Information Retrieval), 2018.

F. Strub, M. Seurin, E. Perez, H. Vries, J. Mary et al., Visual reasoning with multihop feature modulation, Proc. of ECCV, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01927811

Y. Tokozume, Y. Ushiku, and T. Harada, Between-class learning for image classification, Proc. of CVPR (Conference on Computer Vision and Pattern Recognition), 2018.

A. Van-den-oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals et al., Wavenet: A generative model for raw audio, 2016.

A. Van-den-oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals et al., , 2017.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE/ACM TASLP (Transactions on Audio Speech and Language Processing), vol.14, issue.4, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

L. Yang, S. Chou, and Y. Yang, Midinet: A convolutional generative adversarial network for symbolicdomain music generation using 1d and 2d conditions, CoRR, 2017.