B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference, 2016.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free ?-weakly-supervised learning with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.685-694, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01015140

M. Moreaux, N. Lyubova, I. Ferrané, and F. Lerasle, Mind the regularized gap, for human action classification and semi-supervised localization based on visual saliency, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01763103

B. Logan, Mel frequency cepstral coefficients for music modeling

A. Van-den, S. Oord, and . Dieleman, Wavenet : A generative model for raw audio, 2016.

J. Lee, J. Park, K. L. Kim, and J. Nam, Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms, 2017.

J. Lee, J. Park, K. L. Kim, and J. Nam, Samplecnn : End-to-end deep convolutional neural networks using very small filters for music classification, Applied Sciences, vol.8, issue.1, p.150, 2018.

Y. Tokozume, Y. Ushiku, and T. Harada, Learning from between-class examples for deep sound recognition, 2017.

M. Lin, Q. Chen, and S. Yan, Network in network, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00737767

A. Van-den-oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, Conditional image generation with pixelcnn decoders, Advances in Neural Information Processing Systems, pp.4790-4798, 2016.

J. Karol and . Piczak, Esc : Dataset for environmental sound classification, Proceedings of the 23rd ACM international conference on Multimedia, pp.1015-1018, 2015.