J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer normalization, 2016.

F. Bach and M. I. Jordan, Blind one-microphone speech separation: A spectral learning approach, Advances in neural information processing systems, 2005.

A. S. Bregman, Auditory Scene Analysis, 1990.

E. and C. Cherry, Some experiments on the recognition of speech, with one and with two ears, The Journal of the Acoustic Society of America, 1953.

F. Chollet, Xception: Deep learning with depthwise separable convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.

N. Yann, A. Dauphin, M. Fan, D. Auli, and . Grangier, Language modeling with gated convolutional networks, Proceedings of the International Conference on Machine Learning, 2017.

A. Défossez, N. Zeghidour, U. Nicolas, L. Bottou, and F. Bach, Sing: Symbol-to-instrument neural generator, Advances in Neural Information Processing Systems, vol.32, 2018.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010.

M. Emad, M. Grais, H. Umut-sen, and . Erdogan, Deep neural networks for single channel source separation, International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Proceedings of the IEEE international conference on computer vision, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Proceedings of the IEEE international conference on computer vision, 2015.

G. Andrew, M. Howard, B. Zhu, D. Chen, W. Kalenichenko et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis, 2004.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.

Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, Single-channel multi-speaker separation using deep clustering, 2016.

A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar et al., Singing voice separation with deep u-net convolutional networks, 2017.

T. Karras, T. Aila, S. Laine, and J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation, 2017.

T. Karras, S. Laine, and T. Aila, A style-based generator architecture for generative adversarial networks, 2018.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015.

M. Kolbaek, D. Yu, Z. Tan, J. Jensen, M. Kolbaek et al., Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017.

J. , Y. Liu, and Y. Yang, Denoising auto-encoder with recurrent skip connections and residual regression for music source separation, 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.

F. Lluís, J. Pons, and X. Serra, End-to-end music source separation: is it possible in the waveform domain?, 2018.

Y. Luo and N. Mesgarani, Tasnet: time-domain audio separation network for real-time, singlechannel speech separation, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.

Y. Luo and N. Mesgarani, Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation, Speech, and Language Processing, 2019.

E. Nachmani and L. Wolf, Unsupervised singing voice conversion, 2019.

A. A. Nugraha, A. Liutkus, and E. Vincent, Multichannel music separation with deep neural networks, Signal Processing Conference (EUSIPCO), 2016.
URL : https://hal.archives-ouvertes.fr/hal-01334614

Z. Rafii, A. Liutkus, and F. Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. The musdb18 corpus for music separation, 2017.

D. Rethage, J. Pons, and X. Serra, A wavenet for speech denoising, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.

O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, International Conference on Medical image computing and computerassisted intervention, 2015.

T. Sam and . Roweis, One microphone source separation, Advances in Neural Information Processing Systems, 2001.

P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman, Static and dynamic source separation using nonnegative factorizations: A unified view, IEEE Signal Processing Magazine, vol.31, issue.3, 2014.

D. Stoller, S. Ewert, and S. Dixon, Wave-u-net: A multi-scale neural network for end-to-end audio source separation, 2018.

F. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, Open-unmix -a reference implementation for music source separation, Journal of Open Source Software, 2019.

F. Stöter, A. Liutkus, and N. Ito, The 2018 signal separation evaluation campaign, 14th International Conference on Latent Variable Analysis and Signal Separation, 2018.

N. Takahashi and Y. Mitsufuji, Multi-scale multi-band densenets for audio source separation, Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017.

N. Takahashi, N. Goswami, and Y. Mitsufuji, Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation, 2018.

S. Uhlich, F. Giron, and Y. Mitsufuji, Deep neural network based instrument extraction from music, International Conference on Acoustics, Speech and Signal Processing, 2015.

S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp et al., Improving music source separation based on deep neural networks through data augmentation and network blending, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.

E. Vincent, R. Gribonval, and C. Févotte, Performance measurement in blind audio source separation, IEEE Transactions on Audio, Speech and Language Processing, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00544230

, Computational Auditory Scene Analysis, 2006.

Y. Wang, D. Skerry-ryan, Y. Stanton, R. J. Wu, N. Weiss et al., Towards end-to-end speech synthesis, 2017.

Z. Wang, J. L. Roux, D. Wang, and J. R. Hershey, End-to-end speech separation with unfolded iterative phase reconstruction, 2018.

H. Zhang, T. Yann-n-dauphin, and . Ma, Fixup initialization: Residual learning without normalization, 2019.