.. .. Training/testing-settings,

.. .. Results,

D. .. Conclusion,

O. Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn et al., Convolutional neural networks for speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.10, pp.1533-1545, 2014.

J. Baker, The dragon system-an overview, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.23, issue.1, pp.24-29, 1975.

D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, Acoustic scene classification: Classifying environments from the sounds they produce, IEEE Signal Processing Magazine, vol.32, issue.3, pp.16-34, 2015.

F. Bastien, P. Lamblin, R. Pascanu, I. Goodfellow, J. Bergstra et al., Theano: new features and speed improvements, Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2012.

Y. Bengio, Learning deep architectures for ai, Found. Trends Mach. Learn, vol.2, issue.1, pp.1-127, 2009.

Y. Bengio, P. Frasconi, and P. Simard, The problem of learning long-term dependencies in recurrent networks, IEEE International Conference on Neural Networks, vol.3, pp.1183-1188, 1993.

Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, NIPS, 2007.

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166, 1994.

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu et al., Theano: a CPU and GPU math expression compiler, Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.

C. M. Bishop, Neural networks for pattern recognition, ch. The Multi-layer Perceptron, pp.116-161, 1995.

V. Bisot, R. Serizel, S. Essid, and G. Richard, Acoustic scene classification with matrix factorization for unsupervised feature learning, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6445-6449, 2016.

B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, Yaafe, an easy to use and efficient audio feature extraction software, proceedings of the 11th ISMIR conference, 2010.

A. Hervé, N. Bourlard, and . Morgan, The hybrid hmm/mlp approach, pp.155-183, 1994.

B. Mcfee, C. Raffel, D. Liang, P. W. Daniel, M. Ellis et al., librosa: Audio and Music Signal Analysis in Python, Proceedings of the 14th Python in Science Conference, pp.18-25, 2015.

S. Chu, S. Narayanan, C. C. Kuo, and M. J. Mataric, Where am i? scene recognition for mobile robots using audio features, IEEE International Conference on Multimedia and Expo, pp.885-888, 2006.

S. Chu, S. Narayanan, C. C. , and J. Kuo, Environmental sound recognition with time-frequency audio features, IEEE Trans. on Audio, Speech, and Language Processing, vol.17, issue.6, pp.1142-1158, 2009.

A. Li-chun-wang, T. Block, and F. , An industrial-strength audio search algorithm, Proceedings of the 4 th International Conference on Music Information Retrieval, 2003.

C. Clavel, L. Devillers, G. Richard, I. Vasilescu, and T. Ehrette, Detection and analysis of abnormal situations through fear-type acoustic manifestations, IEEE International Conference on Acoustics, Speech and Signal Processing -ICASSP '07, vol.4, pp.21-24, 2007.

C. Clavel, T. Ehrette, and G. Richard, Events detection for an audio-based surveillance system, IEEE International Conference on Multimedia and Expo, pp.1306-1309, 2005.

M. Cowling and R. Sitte, Comparison of techniques for environmental sound recognition, Pattern Recognition Letters, vol.24, issue.15, pp.2895-2907, 2003.

M. Crocco, M. Cristani, A. Trucco, and V. Murino, Audio surveillance: A systematic review, ACM Comput. Surv, vol.48, issue.4, p.46, 2016.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems, vol.2, issue.4, pp.303-314, 1989.

G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, Audio, Speech, and Language Processing, IEEE Transactions, vol.20, issue.1, pp.30-42, 2012.

A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol.39, issue.1, pp.1-38, 1977.

L. Deng and D. Yu, Deep learning: Methods and applications, Found. Trends Signal Process, vol.7, pp.197-387, 2014.

J. Dennis, H. Tran, and E. Chang, Image feature representation of the subband power distribution for robust sound event classification, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.2, pp.367-377, 2013.

A. Diment, E. Cakir, T. Heittola, and T. Virtanen, Automatic recognition of environmental sound events using all-pole group delay features, European Signal Processing Conference, pp.734-738, 2015.

A. Diment and T. Virtanen, Transfer learning of weakly labeled audio, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017.

A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund et al., Audio-based context recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol.14, issue.1, pp.321-329, 2006.

S. Essid, M. Campedel, G. Richard, T. Piatrik, R. Benmokhtar et al., Machine learning techniques for multimedia analysis, pp.59-80, 2011.

S. Essid, S. Parekh, and N. Q. Duong, Alexey Ozerov, Fabio Antonacci, and Augusto Sarti, Multiview approaches to event detection and scene analysis, Romain Serizel, pp.243-276, 2018.

M. Fernández-delgado, E. Cernadas, S. Barro, and D. Amorim, Do we need hundreds of classifiers to solve real world classification ?, J. of Machine Learning Research, vol.15, pp.3133-3181, 2014.

A. Fischer and C. Igel, Progress in pattern recognition, image analysis, computer vision, and applications: 17th ibero american congress, pp.14-36, 2012.

K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics, vol.36, issue.4, pp.193-202, 1980.

K. I. Funahashi, Multilayer neural networks and bayes decision theory, vol.11, pp.209-213, 1998.

F. Ganansia, V. Delcourt, Q. C. Pham, A. Lapeyronnie, C. Baudry et al., Audio-video surveillance system for public transportation, World Congress on Railway Research, 2011.

J. T. Geiger and K. Helwani, Improving event detection for audio surveillance using gabor filterbank features, European Signal Processing Conference, pp.719-723, 2015.

L. Gerosa, G. Valenzise, M. Tagliasacchi, F. Antonacci, and A. Sarti, Scream and Gunshot detection in noisy environments, European Signal Processing Conference, 2007.

F. Gers, J. Schmidhuber, A. , and F. Cummins, Learning to forget: Continual prediction with lstm, Neural Comput, vol.12, issue.10, pp.2451-2471, 2000.

F. Gers, N. Schraudolph, and J. Schmidhuber, Learning precise timing with lstm recurrent networks, J. Mach. Learn. Res, vol.3, pp.115-143, 2003.

Z. Ghahramani and M. I. Jordan, Supervised learning from incomplete data via an em approach, Advances in Neural Information Processing Systems, vol.6, pp.120-127, 1994.

H. Gish, A probabilistic approach to the understanding and training of neural network classifiers, International Conference on Acoustics, Speech, and Signal Processing, vol.3, pp.1361-1364, 1990.

A. Graves, A. R. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.6645-6649, 2013.

A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional lstm and other neural network architectures, Neural Networks, pp.5-6, 2005.

A. Graves, Sequence transduction with recurrent neural networks, International Conference of Machine Learning, 2012.

A. Graves and J. Schmidhuber, Offline handwriting recognition with multidimensional recurrent neural networks, Advances in Neural Information Processing Systems, vol.21, pp.545-552, 2009.

S. Grossberg, Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, i, Indiana Univ, Math. J, vol.19, pp.53-91, 1970.

H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. on Knowl. and Data Eng, vol.21, issue.9, pp.1263-1284, 2009.

T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, Context-dependent sound event detection, EURASIP Journal on Audio, Speech, and Music Processing, issue.1, p.1, 2013.

P. Herrera, G. Peeters, and S. Dubnov, Automatic classification of musical instrument sounds, Journal of New Music Research, vol.32, 2003.

G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Comput, vol.18, issue.7, pp.1527-1554, 2006.

G. E. Hinton and T. J. Sejnowski, Parallel distributed processing: Explorations in the microstructure of cognition, vol.1, pp.282-317, 1986.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

W. Huang, T. Chiew, H. Li, T. S. Kok, and J. Biswas, Scream detection for home applications, IEEE Conference on Industrial Electronics and Applications, pp.2115-2120, 2010.

A. G. Ivakhnenko, The group method of data handling: an rival of the method of stochastic approximation, Soviet Automatic Control, vol.13, issue.3, pp.43-55, 1968.

M. Janvier, X. Alameda-pineda, L. Girin, and R. Horaud, Sound-event recognition with a companion humanoid, IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp.104-111, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00768767

M. Janvier, X. Alameda-pineda, L. Girin, and R. Horaud, Sound Representation and Classification Benchmark for Domestic Robots, IEEE International Conference on Robotics and Automation, vol.2014, pp.6285-6292, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00952092

N. Japkowicz and S. Stephen, The class imbalance problem: A systematic study, Intell. Data Anal, vol.6, issue.5, pp.429-449, 2002.

F. Jelinek, Continuous speech recognition by statistical methods, Proceedings of the IEEE, vol.64, issue.4, pp.532-556, 1976.

K. Kim and H. Ko, Hierarchical approach for abnormal acoustic event classification in an elevator, 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp.89-94, 2011.

T. Komatsu, Y. Senda, and R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.2259-2263, 2016.

G. Lafay, M. Lagrange, M. Rossignol, E. Benetos, and A. Roebel, A morphological model for simulating acoustic scenes and its application to sound event detection, IEEE/ACM Transactions on Audio, Speech and Language Processing, vol.24, issue.10, pp.1854-1864, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01111381

P. Laffitte, D. Sodoyer, C. Tatkeu, and L. Girin, Deep neural networks for automatic detection of screams and shouted speech in subway trains, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6460-6464, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01385272

M. Lagrange, G. Lafay, B. Defreville, and J. Aucouturier, The bag-of-frames approach: a not so sufficient model for urban soundscapes, Journal of the Acoustical Society of America, vol.138, issue.5, pp.487-492, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01082501

Y. Lecun and Y. Bengio, The handbook of brain theory and neural networks, ch. Convolutional Networks for Images, Speech, and Time Series, pp.255-258, 1998.

Y. Lee, D. Han, and H. Ko, Acoustic Signal Based Abnormal Event Detection in Indoor Environment using Multiclass Adaboost, IEEE Trans. on Consumer Electronics, vol.59, issue.3, pp.615-622, 2013.

B. Lei and M. Mak, Sound-event partitioning and feature normalization for robust sound-event detection, Int. Conf. on Digital Signal. Processing (Hong Kong), pp.389-394, 2014.

H. Lei and B. Sun, A study on the dynamic time warping in kernel machines, Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, pp.839-845, 2007.

J. Li, W. Dai, F. Metze, S. Qu, and S. Das, A comparison of deep learning methods for environmental sound detection, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

R. P. Lippmann, Pattern classification using neural networks, IEEE Communications Magazine, vol.27, issue.11, pp.47-50, 1989.

R. P. Lippmann, From statistics to neural networks: Theory and pattern recognition applications, ch. Neural Networks, Bayesian a posteriori Probabilities, and Pattern Classification, pp.83-104, 1994.

T. Bruce and . Lowerre, The harpy speech recognition system, p.7619331, 1976.

R. Lu, Z. Duan, and C. Zhang, Metric learning based data augmentation for environmental sound classification, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017.

S. Warren, W. Mcculloch, and . Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics, vol.5, issue.4, pp.115-133, 1943.

I. Mcloughlin, H. Zhang, Z. Xie, Y. Song, and W. Xiao, Robust sound event classification using deep neural networks, IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol.23, issue.3, pp.540-552, 2015.

A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, Acoustic event detection in real life recordings, 18th European Signal Processing Conference, pp.1267-1271, 2010.

A. Mesaros, T. Heittola, and T. Virtanen, Tut database for acoustic scene classification and sound event detection, 24th European Signal Processing Conference (EUSIPCO), pp.1128-1132, 2016.

A. Mesaros, T. Heittola, and T. Virtanen, Metrics for polyphonic sound event detection, Applied Sciences, vol.6, issue.6, p.162, 2016.

Y. Miao, Kaldi+pdnn: Building dnn-based ASR systems with kaldi and PDNN, 2014.

T. Mikolov, M. Karafiat, and L. Burget, Recurrent neural network based language model

S. Mun, S. Shon, W. Kim, D. K. Han, and H. Ko, Deep neural network based learning and transferring mid-level audio features for acoustic scene classification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.796-800, 2017.

K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino, Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots, Speech Communication, vol.44, issue.1, pp.97-112, 2004.

R. M. Neal, Connectionist learning of belief networks, vol.56, pp.71-113, 1992.

R. M. Neal and G. E. Hinton, Learning in graphical models, pp.335-368, 1998.

Y. Andrew, M. I. Ng, and . Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, Advances in Neural Information Processing Systems, vol.14, pp.841-848, 2002.

K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, Audio-visual speech recognition using deep learning, Applied Intelligence, vol.42, issue.4, pp.722-737, 2015.

S. Ntalampiras, I. Potamitis, and N. Fakotakis, On acoustic surveillance of hazardous situations, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.165-168, 2009.

S. J. Pan and Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, vol.22, issue.10, pp.1345-1359, 2010.

G. Parascandolo, H. Huttunen, and T. Virtanen, Recurrent neural networks for polyphonic sound event detection in real life recordings, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. Mcadams, The timbre toolbox: Extracting audio descriptors from musical signals, The Journal of the Acoustical Society of America, vol.130, issue.5, pp.2902-2916, 2011.

G. Peeters and X. Rodet, Automatically selecting signal descriptors for sound classification, 2002.
URL : https://hal.archives-ouvertes.fr/hal-01161323

A. Pennisi, D. D. Bloisi, and L. Iocchi, Online real-time crowd behavior detection in video sequences, Computer Vision and Image Understanding, vol.144, pp.166-176, 2016.

H. Phan, M. Maas, R. Mazur, and A. Mertins, Random regression forests for acoustic event detection and classification, IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol.23, issue.1, pp.20-31, 2015.

J. Pohjalainen, P. Alku, and T. Kinnunen, Shout detection in noise, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.4968-4971, 2011.

J. Pohjalainen, T. Raitio, and P. Alku, Detection of shouted speech in the presence of ambient noise, Interspeech, pp.2621-2624, 2011.

A. R. Mohamed, G. E. Dahl, and G. Hinton, Acoustic modeling using deep belief networks, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.1, pp.14-22, 2012.

A. R. Mohamed, G. Hinton, and G. Penn, Understanding how deep belief networks perform acoustic modelling, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4273-4276, 2012.

L. R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, issue.2, pp.257-286, 1989.

F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review, pp.65-386, 1958.

J. Rouas, J. Louradour, and S. Ambellouis, Audio events detection in public transport vehicle, Intelligent Transportation Systems Conference, pp.733-738, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00664991

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, vol.323, pp.533-536, 1986.

, Parallel distributed processing: Explorations in the microstructure of cognition, vol.1, pp.318-362, 1986.

T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolutional, long short-term memory, fully connected deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4580-4584, 2015.

H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.26, issue.1, pp.43-49, 1978.

J. Salamon and J. P. Bello, Feature learning with deep scattering for urban sound analysis, European Signal Processing Conference, pp.729-733, 2015.

V. Saligrama and Z. Chen, Video anomaly detection based on local statistical aggregates, IEEE Conference on Computer Vision and Pattern Recognition, pp.2112-2119, 2012.

J. Schmidhuber, Deep learning in neural networks: An overview, vol.61, pp.85-117, 2015.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, vol.15, pp.1929-1958, 2014.

R. Stiefelhagen, K. Bernardin, R. Bowers, R. Rose, M. Michel et al., The clear 2007 evaluation, 2007.

B. L. Sturm and A. Lerch, An introduction to audio content analysis: Applications in signal processing and music informatics, Computer Music Journal, vol.37, issue.4, pp.90-91, 2013.

A. Temko, C. Nadeu, D. Macho, R. Malkin, C. Zieger et al., Computers in the human interaction loop, ch. Acoustic Event Detection and Classification, pp.61-73, 2009.

G. Valenzise, L. Gerosa, M. Tagliasacchi, E. Antonacci, and A. Sarti, Scream and gunshot detection and localization for audio-surveillance systems, IEEE Conference on Advanced Video and Signal Based Surveillance, pp.21-26, 2007.

X. Valero and F. Alías, Gammatone wavelet features for sound classification in surveillance applications, European Signal Processing Conference, pp.1658-1662, 2012.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to sequence -video to text, IEEE International Conference on Computer Vision (ICCV), pp.4534-4542, 2015.

T. Virtanen, A. Mesaros, T. Heittola, M. Plumbley, P. Foster et al., Proceedings of the detection and classification of acoustic scenes and events 2016 workshop, 2016.

J. Wang, C. Lin, B. Chen, and M. Tsai, Gabor-based nonuniform scale-frequency map for environmental sound classification in home automation, IEEE Trans. on Automation Science and Engineering, vol.11, issue.2, pp.607-613, 2014.

Y. Wang and F. Metze, A first attempt at polyphonic sound event detection using connectionnist temporal classification, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2017.

D. Wei, J. Li, P. Pham, S. Das, and S. Qu, Acoustic scene recognition with deep neural networks (DCASE challenge, DCASE2016 Challenge, 2016.

P. Werbos, Backpropagation through time: what it does and how to do it, vol.78, pp.1550-1560, 1990.

X. Wu, H. Gong, P. Chen, Z. Zhong, and Y. Xu, Surveillance robot utilizing video and audio information, J. of Intelligent and Robotic Systems, vol.55, issue.4, pp.403-421, 2009.

Z. Xing, J. Pei, and E. Keogh, A brief survey on sequence classification, SIGKDD Explor. Newsl, vol.12, issue.1, pp.40-48, 2010.

G. Xiong, X. Wu, Y. L. Chen, and Y. Ou, Abnormal crowd behavior detection based on the energy model, IEEE International Conference on Information and Automation, pp.495-500, 2011.

S. J. Young and L. L. Chase, Speech recognition evaluation: a review of the u.s. csr and lvcsr programmes, Computer Speech and Language, vol.12, issue.4, pp.263-279, 1998.

G. P. Zhang, Neural networks for classification: a survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol.30, issue.4, pp.451-462, 2000.

X. Zhang and J. Wu, Deep belief networks based voice activity detection, IEEE Trans. on Audio, Speech, and Language Processing, vol.21, issue.4, pp.697-710, 2013.

X. Zhou, X. Zhuang, M. Liu, H. Tang, M. Hasegawa-johnson et al., revised selected papers, ch. HMM-Based Acoustic Event Detection with AdaBoost Feature Selection, pp.345-353, 2007.

R. Zouaoui, R. Audigier, S. Ambellouis, F. Capman, H. Benhadda et al., Embedded security system for multi-modal surveillance in a railway carriage, SPIE security and defence, vol.9652, pp.9652-9663, 2015.