J. Valin, J. Rouat, and F. Michaud, Microphone array postfilter for separation of simultaneous non-stationary sources, Proc. ICASSP, 2004.

D. Bechler, M. S. Schlosser, and K. Kroschel, System for robust 3D speaker tracking using microphone array measurements, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), pp.2117-2122, 2004.
DOI : 10.1109/IROS.2004.1389722

D. B. Ward, E. A. Lehmann, and R. C. Williamson, Particle filtering algorithms for tracking an acoustic source in a reverberant environment, IEEE Transactions on Speech and Audio Processing, vol.11, issue.6, pp.826-836, 2003.
DOI : 10.1109/TSA.2003.818112

J. Valin, F. Michaud, B. Hadjou, and J. Rouat, Localization of simultaneous moving sound sources for mobile robot using a frequency- domain steered beamformer approach, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004, pp.1033-1038, 2004.
DOI : 10.1109/ROBOT.2004.1307286

Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.32, issue.6, pp.1109-1121, 1984.
DOI : 10.1109/TASSP.1984.1164453

I. Cohen and B. Berdugo, Speech enhancement for non-stationary noise environments, Signal Processing, vol.81, issue.11, pp.2403-2418, 2001.
DOI : 10.1016/S0165-1684(01)00128-1

J. Huang, N. Ohnishi, X. Guo, and N. Sugie, Echo avoidance in a computational model of the precedence effect, Speech Communication, vol.27, pp.3-4, 1999.

R. Duraiswami, D. Zotkin, and L. Davis, Active speech source localization by a dual coarse-to-fine search, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), pp.3309-3312, 2001.
DOI : 10.1109/ICASSP.2001.940366
URL : http://www.umiacs.umd.edu/~dz/pbpslist/icassp01bf.pdf

H. Asoh, F. Asano, K. Yamamoto, T. Yoshimura, Y. Motomura et al., An application of a particle filter to bayesian multiple sound source tracking with audio and video information fusion, Proc. Fusion, pp.805-812, 2004.

A. Doucet, S. Godsill, and C. Andrieu, On sequential Monte Carlo sampling methods for bayesian filtering, Statistics and Computing, vol.10, issue.3, pp.197-208, 2000.
DOI : 10.1023/A:1008935410038

S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, Sound event detection in multichannel audio using spatial and harmonic features, Proc IEEE AASP Chall Detect Classif Acoust Scenes Events, 2016.

A. Amir, M. Berg, S. F. Chang, W. Hsu, G. Iyengar et al., Ibm research trecvid-2003 video retrieval system, p.2003, 2003.

G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, Deep canonical correlation analysis, Proc Int Conf Mach Learn, 2013.

F. Antonacci, D. Lonoce, M. Motta, A. Sarti, and S. Tubaro, Efficient Source Localization and Tracking in Reverberant Environments Using Microphone Arrays, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., p.1061, 2005.
DOI : 10.1109/ICASSP.2005.1416195

F. Antonacci, M. Matteucci, D. Migliore, D. Riva, A. Sarti et al., Tracking Multiple Acoustic Sources in Reverberant Environments using Regularized Particle Filter, 2007 15th International Conference on Digital Signal Processing, pp.99-102, 2007.
DOI : 10.1109/ICDSP.2007.4288528

T. Arai, H. Hodoshima, and K. Yasu, Using Steady-State Suppression to Improve Speech Intelligibility in Reverberant Environments for Elderly Listeners, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.7, pp.1775-1780, 2010.
DOI : 10.1109/TASL.2010.2052165

A. Rúa, E. Bredin, H. H. García-mateo, C. Chollet, G. G. et al., Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models, Pattern Anal Appl, vol.12, issue.3, pp.271-284, 2008.

M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking, IEEE Transactions on Signal Processing, vol.50, issue.2, pp.174-188, 2002.
DOI : 10.1109/78.978374

H. Asoh, F. Asano, T. Yoshimura, K. Yamamoto, Y. Motomura et al., An application of a particle filter to bayesian multiple sound source tracking with audio and video information fusion, Proc Fusion, pp.805-812, 2004.

P. K. Atrey, M. A. Hossain, A. Saddik, and M. S. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimedia Systems, vol.24, issue.11, pp.345-379, 2010.
DOI : 10.1115/1.3662552
URL : http://www.comp.nus.edu.sg/%7Emohan/papers/fusion_survey.pdf

Z. Barzelay and Y. Y. Schechner, Harmony in Motion, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383344

A. Beck, P. Stoica, and J. Li, Exact and Approximate Solutions of Source Localization Problems, IEEE Transactions on Signal Processing, vol.56, issue.5, pp.1770-1778, 2008.
DOI : 10.1109/TSP.2007.909342

R. Benmokhtar and B. Huet, Neural network combining classifier based on Dempster-Shafer theory for semantic indexing in video content Adv Multimed Model pp, pp.196-205, 2006.

N. Bertin, R. Badeau, and E. Vincent, Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.3, pp.538-549, 2010.
DOI : 10.1109/TASL.2010.2041381
URL : https://hal.archives-ouvertes.fr/inria-00557088

F. Bießmann, F. C. Meinecke, A. Gretton, A. Rauch, G. Rainer et al., Temporal kernel CCA and its application in multimodal neuronal data analysis, Machine Learning, vol.79, issue.1-2, pp.5-27, 2010.
DOI : 10.1017/CBO9780511809682

J. Bitzer and K. U. Simmer, Superdirective Microphone Arrays, pp.19-38, 2001.
DOI : 10.1007/978-3-662-04619-7_2

J. Bitzer, K. U. Simmer, and K. D. Kammeyer, Theoretical noise reduction limits of the generalized sidelobe canceller (GSC) for speech enhancement, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), pp.2965-2968, 1999.
DOI : 10.1109/ICASSP.1999.761385

C. Blandin, A. Ozerov, and E. Vincent, Multi-source TDOA estimation in reverberant audio using angular spectra and clustering, Signal Processing, vol.92, issue.8, pp.1950-1960, 2012.
DOI : 10.1016/j.sigpro.2011.09.032
URL : https://hal.archives-ouvertes.fr/inria-00576297

P. Bofill and M. Zibulevsky, Underdetermined blind source separation using sparse representations, Signal Processing, vol.81, issue.11, pp.2353-2362, 2001.
DOI : 10.1016/S0165-1684(01)00120-7
URL : http://iew3.technion.ac.il/~mcib/undetermICA.pdf

K. Bousmalis and L. P. Morency, Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition, Face and Gesture 2011, pp.746-752, 2011.
DOI : 10.1109/FG.2011.5771341

H. Bredin and G. Chollet, Measuring audio and visual speech synchrony: methods and applications, IET International Conference on Visual Information Engineering (VIE 2006), pp.255-260, 2006.
DOI : 10.1049/cp:20060538
URL : http://ieeexplore.ieee.org/iel5/4286642/4286643/04286698.pdf

A. Brutti, M. Omologo, and P. Svaizer, Localization of multiple speakers based on a two step acoustic map analysis, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4349-4352, 2008.
DOI : 10.1109/ICASSP.2008.4518618

A. Canclini, F. Antonacci, A. Sarti, and S. Tubaro, Acoustic Source Localization With Distributed Asynchronous Microphone Networks, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.2, pp.439-443, 2013.
DOI : 10.1109/TASL.2012.2215601

A. Canclini, P. Bestagini, F. Antonacci, M. Compagnoni, A. Sarti et al., A Robust and Low-Complexity Source Localization Algorithm for Asynchronous Distributed Microphone Networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.10, pp.1563-1575, 2015.
DOI : 10.1109/TASLP.2015.2439040

J. Capon, High-resolution frequency-wavenumber spectrum analysis, Proceedings of the IEEE, vol.57, issue.8, pp.1408-1418, 1969.
DOI : 10.1109/PROC.1969.7278

G. C. Carter, Coherence and time delay estimation, Proceedings of the IEEE, vol.75, issue.2, pp.236-255, 1987.
DOI : 10.1109/PROC.1987.13723

A. Casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval, Blind Audiovisual Source Separation Based on Sparse Redundant Representations, IEEE Transactions on Multimedia, vol.12, issue.5, pp.358-371, 2010.
DOI : 10.1109/TMM.2010.2050650
URL : https://hal.archives-ouvertes.fr/inria-00541412

A. L. Casanovas and P. Vandergheynst, Nonlinear video diffusion based on audio-video synchrony, IEEE Trans Multimed, 2010.

S. F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa et al., Large-scale multimodal semantic concept detection for consumer video, Proceedings of the international workshop on Workshop on multimedia information retrieval , MIR '07, pp.255-264, 2007.
DOI : 10.1145/1290082.1290118
URL : http://www.ee.columbia.edu/dvmm/publications/07/mir2007-kkalg.pdf

C. C. Chibelushi, J. S. Mason, and N. Deravi, Integrated person identification using voice and facial features, IEE Colloquium on Image Processing for Security Applications, pp.1-4, 1997.
DOI : 10.1049/ic:19970380

T. Choudhury, J. M. Rehg, V. Pavlovic, and A. Pentland, Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection, Object recognition supported by user interaction for service robots, pp.789-794, 2002.
DOI : 10.1109/ICPR.2002.1048137
URL : http://www.media.mit.edu/~tanzeem/tanzeem_icpr02.pdf

A. Cichocki, R. Zdunek, and S. Amari, Nonnegative Matrix and Tensor Factorization, IEEE Signal Process Mag, vol.25, issue.1, pp.142-145, 2008.
DOI : 10.1002/9780470747278

M. Cobos, A. Marti, and J. J. Lopez, A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling, IEEE Signal Processing Letters, vol.18, issue.1, pp.71-74, 2011.
DOI : 10.1109/LSP.2010.2091502

M. Compagnoni, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, Localization of Acoustic Sources Through the Fitting of Propagation Cones Using Multiple Independent Arrays, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.7, pp.1964-1975, 2012.
DOI : 10.1109/TASL.2012.2191958

H. Cox, R. Zeskind, and T. Kooij, Practical supergain, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.34, issue.3, pp.393-398, 1986.
DOI : 10.1109/TASSP.1986.1164847

M. Cristani, M. Bicego, and V. Murino, Audio-Visual Event Recognition in Surveillance Video Sequences, IEEE Transactions on Multimedia, vol.9, issue.2, pp.257-267, 2007.
DOI : 10.1109/TMM.2006.886263

M. Crocco, A. D. Bue, and V. Murino, A Bilinear Approach to the Position Self-Calibration of Multiple Sensors, IEEE Transactions on Signal Processing, vol.60, issue.2, pp.660-673, 2012.
DOI : 10.1109/TSP.2011.2175387

R. Cutler and L. Davis, Look who's talking: speaker detection using video and audio correlation, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), pp.1589-1592, 2000.
DOI : 10.1109/ICME.2000.871073

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.886-893, 2005.
DOI : 10.1109/CVPR.2005.177
URL : https://hal.archives-ouvertes.fr/inria-00548512

D. 'arca, E. Robertson, N. Hopgood, and J. , Look who's talking: Detecting the dominant speaker in a cluttered scenario, Proc IEEE Int Conf Acoust Speech Signal Process, 2014.

J. Dibiase, H. Silverman, and M. Brandstein, Robust Localization in Reverberant Rooms, Microphone Arrays, pp.157-180, 2001.
DOI : 10.1007/978-3-662-04619-7_8

J. Dmochowski, J. Benesty, and S. Affes, A Generalized Steered Response Power Method for Computationally Viable Source Localization, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.8, pp.2510-2526, 2007.
DOI : 10.1109/TASL.2007.906694

H. Do, H. Silverman, and Y. Yu, A Real-Time SRP-PHAT Source Location Implementation using Stochastic Region Contraction(SRC) on a Large-Aperture Microphone Array, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07, pp.121-124, 2007.
DOI : 10.1109/ICASSP.2007.366631

S. Doclo and M. Moonen, GSVD-based optimal filtering for single and multimicrophone speech enhancement, IEEE Transactions on Signal Processing, vol.50, issue.9, pp.2230-2244, 2002.
DOI : 10.1109/TSP.2002.801937
URL : ftp://ftp.esat.kuleuven.ac.be/pub/SISTA/doclo/reports/01-30.ps.gz

N. Q. Duong, E. Vincent, and R. Gribonval, Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.7, pp.1830-1840, 2010.
DOI : 10.1109/TASL.2010.2050716
URL : https://hal.archives-ouvertes.fr/inria-00541865

N. Q. Duong, E. Vincent, and R. Gribonval, Spatial location priors for Gaussian model based reverberant audio source separation, EURASIP Journal on Advances in Signal Processing, vol.92, issue.4, pp.1-11, 2013.
DOI : 10.1007/978-3-642-15995-4_8
URL : https://hal.archives-ouvertes.fr/hal-00870191

G. W. Elko, Spatial Coherence Functions for Differential Microphones in Isotropic Noise Fields, Microphone Arrays: Signal Processing Techniques and Applications, pp.61-85, 2001.
DOI : 10.1007/978-3-662-04619-7_4

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition. arXiv preprint arXiv, pp.1604-06573, 2016.
DOI : 10.1109/cvpr.2016.213

C. Févotte and J. F. Cardoso, Maximum likelihood approach for blind audio source separation using time-frequency Gaussian source models, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., pp.78-81, 2005.
DOI : 10.1109/ASPAA.2005.1540173

J. Fisher, T. Darrell, W. T. Freeman, P. Viola, I. Fisher et al., Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, Proc Adv Neural Inf Process Syst, Ml, pp.772-778, 2001.

D. Fitzgerald, M. Cranitch, and E. Coyle, Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation, Computational Intelligence and Neuroscience, vol.2008, 2008.
DOI : 10.1109/TSA.2005.858005

D. Fitzgerald, M. Cranitch, and E. Coyle, Using tensor factorisation models to separate drums from polyphonic music, Proc Int Conf Digit Audio Eff, 2009.

S. Foucher, F. Lalibert, G. Boulianne, and L. Gagnon, A Dempster-Shafer Based Fusion Approach for Audio-Visual Speech Recognition with Application to Large Vocabulary French Speech, 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006.
DOI : 10.1109/ICASSP.2006.1660091

O. L. Frost, An algorithm for linearly constrained adaptive array processing, Proceedings of the IEEE, vol.60, issue.8, pp.926-935, 1972.
DOI : 10.1109/PROC.1972.8817

A. Gandhi, A. Sharma, A. Biswas, and O. Deshmukh, GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion, 2016.
DOI : 10.1109/ICCV.2015.512

T. Gehrig, K. Nickel, H. Ekenel, U. Klee, and J. Mcdonough, Kalman filters for audio-video source localization, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., pp.118-121, 2005.
DOI : 10.1109/ASPAA.2005.1540183
URL : http://www.gehrignet.de/media/pdf/waspaa-October2005.pdf

R. Goecke and J. B. Millar, Statistical Analysis of the Relationship between Audio and Video Speech Parameters for Australian English, Proc ISCA Tutor Res Workshop Audit-Vis Speech Process, pp.133-138, 2003.

J. N. Gowdy, A. Subramanya, C. Bartels, and J. A. Bilmes, DBN based multi-stream models for audio-visual speech recognition, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004.
DOI : 10.1109/ICASSP.2004.1326155
URL : http://ssli.ee.washington.edu/people/bilmes/mypapers/dbn_icassp04.pdf

G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proceedings of the second international conference on Human Language Technology Research -, pp.1-6, 2002.
DOI : 10.3115/1289189.1289244
URL : http://www.research.ibm.com/AVSTG/HLT02_ASYNCHRONY.pdf

R. Gribonval and M. Zibulevsky, Sparse component analysis, pp.367-420, 2010.
DOI : 10.1016/B978-0-12-374726-6.00015-1
URL : https://hal.archives-ouvertes.fr/inria-00541853

L. Griffiths and C. Jim, An alternative approach to linearly constrained adaptive beamforming, IEEE Transactions on Antennas and Propagation, vol.30, issue.1, pp.27-34, 1982.
DOI : 10.1109/TAP.1982.1142739

T. Gustafsson, B. D. Rao, and M. Trivedi, Source localization in reverberant environments: modeling and statistical analysis, IEEE Transactions on Speech and Audio Processing, vol.11, issue.6, pp.791-803, 2003.
DOI : 10.1109/TSA.2003.818027
URL : http://www.itr-rescue.org/pubs/upload/335_Gustafsson,2005.pdf

D. R. Hardoon, S. Szedmak, and J. Shawe-taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Computation, vol.10, issue.12, pp.2639-2664, 2004.
DOI : 10.1093/biomet/58.3.433
URL : http://eprints.ecs.soton.ac.uk/9225/01/tech_report03.pdf

S. Haykin, Adaptive Filter Theory, 5 edn, 2014.

S. Haykin, J. H. Justice, N. L. Owsley, J. Yen, and A. C. Kak, Array signal processing, 1985.

H. Hotelling, Relations between two sets of variates, Biometrika, vol.2834, pp.321-377, 1936.

D. Hu, X. Li, and X. Lu, Temporal Multimodal Learning in Audiovisual Speech Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.389

P. S. Huang, X. Zhuang, and M. Hasegawa-johnson, Improving acoustic event detection using generalizable visual features and multi-modality modeling, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.349-352, 2011.
DOI : 10.1109/ICASSP.2011.5946412
URL : http://www.ifp.illinois.edu/speech/pubs/2011/huang11icassp.pdf

Y. Huang, J. Benesty, G. Elko, and R. Mersereati, Real-time passive source localization: a practical linear-correction least-squares approach, IEEE Transactions on Speech and Audio Processing, vol.9, issue.8, pp.943-956, 2001.
DOI : 10.1109/89.966097

Y. Ivanov, T. Serre, and J. Bouvrie, Error weighted classifier combination for multi-modal human identification, 2005.

H. Izadinia, I. Saleemi, and M. Shah, Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects, IEEE Transactions on Multimedia, vol.15, issue.2, pp.378-390, 2013.
DOI : 10.1109/TMM.2012.2228476

Y. Izumi, N. Ono, and S. Sagayama, Sparseness-Based 2CH BSS using the EM Algorithm in Reverberant Environment, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.147-150, 2007.
DOI : 10.1109/ASPAA.2007.4393015

X. Jaureguiberry, E. Vincent, and G. Richard, Fusion Methods for Speech Enhancement and Audio Source Separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, issue.7, pp.1266-1279, 2016.
DOI : 10.1109/TASLP.2016.2553441
URL : https://hal.archives-ouvertes.fr/hal-01120685

I. H. Jhuo, G. Ye, S. Gao, D. Liu, Y. G. Jiang et al., Discovering joint audio???visual codewords for video event detection, Machine Vision and Applications, vol.9, issue.1, pp.33-47, 2014.
DOI : 10.1145/2324796.2324843

W. Jiang, C. Cotton, S. F. Chang, D. Ellis, and A. Loui, Short-term audiovisual atoms for generic video concept classification, Proc ACM Int Conf Multimed, pp.5-14, 2009.
DOI : 10.1145/1631272.1631277
URL : http://labrosa.ee.columbia.edu/~dpwe/pubs/JiangCCEL09-ST-AVA.pdf

W. Jiang and A. C. Loui, Audio-visual grouplet, Proceedings of the 19th ACM international conference on Multimedia, MM '11, pp.123-132, 2011.
DOI : 10.1145/2072298.2072316

Y. G. Jiang, S. Bhattacharya, S. F. Chang, and M. Shah, High-level event recognition in unconstrained videos, International Journal of Multimedia Information Retrieval, vol.73, issue.2, pp.73-101, 2013.
DOI : 10.1007/s11263-006-9794-4

Y. G. Jiang, X. Zeng, and G. Ye, Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching, Proc NIST TRECVID-2003, 2003.

C. Joder, S. Essid, and G. Richard, Temporal Integration for Audio Classification With Application to Musical Instrument Classification, IEEE Transactions on Audio, Speech, and Language Processing, vol.17, issue.1, 2008.
DOI : 10.1109/TASL.2008.2007613
URL : http://perso.telecom-paristech.fr/~grichard/Publications/TSALP_joder08.pdf

A. Jourjine, S. Rickard, and O. Y?lmaz, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), pp.2985-2988, 2000.
DOI : 10.1109/ICASSP.2000.861162

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.1725-1732, 2014.
DOI : 10.1109/CVPR.2014.223
URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2014-deepvideo-rahuls.pdf

J. Kay, Feature discovery under contextual supervision using mutual information, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, pp.79-84, 1992.
DOI : 10.1109/IJCNN.1992.227286

E. Kidron, Y. Schechner, and M. Elad, Pixels that Sound, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp.88-95, 2005.
DOI : 10.1109/CVPR.2005.274

E. Kijak, G. Gravier, P. Gros, L. Oisel, and F. Bimbot, HMM based structuring of tennis videos using visual and audio cues, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), pp.309-312, 2003.
DOI : 10.1109/ICME.2003.1221310

J. Kittler, M. Hatef, R. P. Duin, and J. Matas, On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, issue.3, pp.226-239, 1998.
DOI : 10.1109/34.667881

U. Klee, T. , G. Mcdonough, and J. , Kalman Filters for Time Delay of Arrival-Based Source Localization, EURASIP Journal on Advances in Signal Processing, vol.11, issue.3, pp.1-15, 2006.
DOI : 10.1155/ASP/2006/12378

C. Knapp and G. Carter, The generalized correlation method for estimation of time delay, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.24, issue.4, pp.320-327, 1976.
DOI : 10.1109/TASSP.1976.1162830

T. G. Kolda and B. W. Bader, Tensor Decompositions and Applications, SIAM Review, vol.51, issue.3, pp.455-500, 2009.
DOI : 10.1137/07070111X

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Proc Adv Neural Inf Process Syst, pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

G. F. Kuhn, Model for the interaural time differences in the azimuthal plane, The Journal of the Acoustical Society of America, vol.62, issue.1, pp.157-167, 1977.
DOI : 10.1121/1.381498

P. L. Lai and C. Fyfe, KERNEL AND NONLINEAR CANONICAL CORRELATION ANALYSIS, International Journal of Neural Systems, vol.11, issue.2, pp.365-378, 2000.
DOI : 10.1162/089976698300017467

A. Levy, S. Gannot, and E. Habets, Multiple-Hypothesis Extended Particle Filter for Acoustic Source Localization in Reverberant Environments, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.6, pp.1540-1555, 2011.
DOI : 10.1109/TASL.2010.2093517

D. Li, N. Dimitrova, M. Li, and I. Sethi, Multimedia content processing through cross-modal association, Proceedings of the eleventh ACM international conference on Multimedia , MULTIMEDIA '03, 2003.
DOI : 10.1145/957013.957143

A. Lim, K. Nakamura, K. Nakadai, T. Ogata, and H. G. Okuno, Audio-visual musical instrument recognition, 2011.

Q. Liu, W. Wang, P. J. Jackson, M. Barnard, J. Kittler et al., Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking, IEEE Transactions on Signal Processing, vol.61, issue.22, pp.61-5520, 2013.
DOI : 10.1109/TSP.2013.2277834

A. Liutkus, J. L. Durrieu, L. Daudet, and G. Richard, An overview of informed audio source separation, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp.1-4, 2013.
DOI : 10.1109/WIAMIS.2013.6616139
URL : https://hal.archives-ouvertes.fr/hal-00958661

D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, issue.2, pp.91-110, 2004.
DOI : 10.1023/B:VISI.0000029664.99615.94
URL : http://www.cs.ubc.ca/~lowe/papers/ijcv03.ps

V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, Anomaly detection in crowded scenes, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p.250, 2010.
DOI : 10.1109/CVPR.2010.5539872
URL : http://www.svcl.ucsd.edu/publications/conference/2010/cvpr2010/anomaly.pdf

S. Makino, T. W. Lee, and H. Sawada, Blind Speech Separation, 2007.
DOI : 10.1007/978-1-4020-6479-1

M. Mandel, S. Bressler, B. Shinn-cunningham, and D. Ellis, Evaluating Source Separation Algorithms With Reverberant Speech, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.7, pp.1872-1883, 2010.
DOI : 10.1109/TASL.2010.2052252

M. Mandel and D. Ellis, EM Localization and Separation using Interaural Level and Phase Cues, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.275-278, 2007.
DOI : 10.1109/ASPAA.2007.4392987
URL : http://www.ee.columbia.edu/ln/labrosa/proceeds/waspaa/2007/paper/0026.pdf

P. Maragos, P. Gros, A. Katsamanis, and G. Papandreou, Cross-Modal Integration for Performance Improving in Multimedia: A Review, Multimodal processing and interaction, pp.1-46, 2008.
DOI : 10.1007/978-0-387-76316-3_1

A. Marti, M. Cobos, J. Lopez, and J. Escolano, A steered response power iterative method for high-accuracy acoustic source localization, The Journal of the Acoustical Society of America, vol.134, issue.4, pp.2627-2630, 2013.
DOI : 10.1121/1.4820885

A. Metallinou, S. Lee, and S. Narayanan, Decision level combination of multiple modalities for recognition and analysis of emotional expression, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2462-2465, 2010.
DOI : 10.1109/ICASSP.2010.5494890

S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva et al., An overview on video forensics, APSIPA Transactions on Signal and Information Processing, vol.5284, p.2, 2012.
DOI : 10.1109/TIP.2009.2028251
URL : https://doi.org/10.1017/atsip.2012.2

G. Monaci, P. Jost, P. Vandergheynst, B. Mailhé, S. Lesage et al., Learning Multimodal Dictionaries, IEEE Transactions on Image Processing, vol.16, issue.9, pp.2272-2283, 2007.
DOI : 10.1109/TIP.2007.901813
URL : https://hal.archives-ouvertes.fr/inria-00544772

G. Monaci and P. Vandergheynst, Audiovisual Gestalts, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06), pp.200-200, 2006.
DOI : 10.1109/CVPRW.2006.34

G. Monaci, P. Vandergheynst, and F. T. Sommer, Learning Bimodal Structure in Audio???Visual Data, IEEE Transactions on Neural Networks, vol.20, issue.12, pp.1898-1910, 2009.
DOI : 10.1109/TNN.2009.2032182
URL : https://infoscience.epfl.ch/record/125304/files/IEEETNN_final.pdf

B. C. Moore, Introduction to the psychology of hearing, 1977.

K. P. Murphy, Dynamic Bayesian Networks: Representation, Inference and Learning, 2002.

M. R. Naphade, A. Garg, and T. S. Huang, Audio-visual event detection using duration dependent input output Markov models, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001), pp.39-43, 2001.
DOI : 10.1109/IVL.2001.990854

A. V. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao et al., A coupled {HMM} for audiovisual speech recognition, Proc IEEE Int Conf Acoust Speech Signal Process, 2002.
DOI : 10.1109/icassp.2002.1006167
URL : http://www.cs.ubc.ca/~murphyk/Papers/icassp02.pdf

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee et al., Multimodal deep learning, Proc Int Conf Mach Learn, pp.689-696, 2011.

V. T. Nguyen, D. L. Nguyen, M. T. Tran, D. D. Le, D. A. Duong et al., Query-adaptive late fusion with neural network for instance search, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP), pp.1-6, 2015.
DOI : 10.1109/MMSP.2015.7340795

J. Nikunen and T. Virtanen, Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.3, pp.727-739, 2014.
DOI : 10.1109/TASLP.2014.2303576

M. Omologo and P. Svaizer, Acoustic event localization using a crosspower-spectrum phase based technique, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing, 1994.
DOI : 10.1109/ICASSP.1994.389667

T. Otsuka, K. Ishiguro, H. Sawada, and H. G. Okuno, Bayesian Nonparametrics for Microphone Array Processing, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.2, pp.493-504, 2014.
DOI : 10.1109/TASLP.2013.2294582

A. Ozerov and C. Févotte, Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.3, pp.550-563, 2010.
DOI : 10.1109/TASL.2009.2031510

A. Ozerov, C. Févotte, R. Blouet, and J. L. Durrieu, Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
DOI : 10.1109/ICASSP.2011.5946389
URL : https://hal.archives-ouvertes.fr/inria-00564851

A. Ozerov, E. Vincent, and F. Bimbot, A General Flexible Framework for the Handling of Prior Information in Audio Source Separation, IEEE Transactions on Audio, Speech, and Language Processing, vol.20, issue.4, pp.1118-1133, 2012.
DOI : 10.1109/TASL.2011.2172425
URL : https://hal.archives-ouvertes.fr/hal-00626962

S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Pérez et al., Motion informed audio source separation, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
DOI : 10.1109/ICASSP.2017.7951787
URL : https://hal.archives-ouvertes.fr/hal-01447977

R. Parisi, P. Croene, and A. Uncini, Particle swarm localization of acoustic sources in the presence of reverberation, 2006 IEEE International Symposium on Circuits and Systems, p.4, 2006.
DOI : 10.1109/ISCAS.2006.1693689

L. Parra and C. Spence, Convolutive blind separation of non-stationary sources, IEEE Transactions on Speech and Audio Processing, vol.8, issue.3, pp.320-327, 2000.
DOI : 10.1109/89.841214

P. Pertilä, M. Mieskolainen, and M. Hämäläinen, Closed-form self-localization of asynchronous microphone arrays, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays, pp.139-144, 2011.
DOI : 10.1109/HSCMA.2011.5942380

P. Stoica and R. Moses, Spectral analysis of signals, NJ, 2005.

A. Rocha, W. Scheirer, T. Boult, and S. Goldenstein, Vision of the unseen, ACM Computing Surveys, vol.43, issue.4, p.26, 2011.
DOI : 10.1145/1978802.1978805

L. Rokach, Ensemble-based classifiers, Artificial Intelligence Review, vol.13, issue.4, 2010.
DOI : 10.1142/5686

R. Roy and T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.37, issue.7, pp.984-995, 1989.
DOI : 10.1109/29.32276

D. A. Sadlier and N. E. O-'connor, Event detection in field sports video using audio-visual features and a support vector Machine, IEEE Transactions on Circuits and Systems for Video Technology, vol.15, issue.10, pp.1225-1233, 2005.
DOI : 10.1109/TCSVT.2005.854237

H. Sawada, R. Mukai, S. Araki, and S. Makino, A Robust and Precise Method for Solving the Permutation Problem of Frequency-Domain Blind Source Separation, IEEE Transactions on Speech and Audio Processing, vol.12, issue.5, pp.530-538, 2004.
DOI : 10.1109/TSA.2004.832994

H. Schau and A. Robinson, Passive source localization employing intersecting spherical surfaces from time-of-arrival differences, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.35, issue.8, pp.1223-1225, 1987.
DOI : 10.1109/TASSP.1987.1165266

J. Scheuing and B. Yang, Disambiguation of TDOA Estimation for Multiple Sources in Reverberant Environments, IEEE Transactions on Audio, Speech, and Language Processing, vol.16, issue.8, pp.1479-1489, 2008.
DOI : 10.1109/TASL.2008.2004533

R. Schmidt, Multiple emitter location and signal parameter estimation, IEEE Transactions on Antennas and Propagation, vol.34, issue.3, pp.276-280, 1986.
DOI : 10.1109/TAP.1986.1143830

F. Sedighin, M. Babaie-zadeh, B. Rivet, and C. Jutten, Two multimodal approaches for single microphone source separation, 2016 24th European Signal Processing Conference (EUSIPCO), 2016.
DOI : 10.1109/EUSIPCO.2016.7760220
URL : https://hal.archives-ouvertes.fr/hal-01400542

N. Seichepine, S. Essid, C. Févotte, and O. Cappe, Soft nonnegative matrix co-factorization with application to multimodal speaker diarization, Proc IEEE Int Conf Acoust Speech Signal Process, 2013.
DOI : 10.1109/icassp.2013.6638316

N. Seichepine, S. Essid, C. Fevotte, and O. Cappe, Soft Nonnegative Matrix Co-Factorization, IEEE Transactions on Signal Processing, vol.62, issue.22, p.99, 2014.
DOI : 10.1109/TSP.2014.2360141
URL : https://hal.archives-ouvertes.fr/hal-01116863

R. Serizel, V. Bisot, S. Essid, and G. Richard, Machine listening techniques as a complement to video image analysis in forensics, 2016 IEEE International Conference on Image Processing (ICIP), pp.948-952, 2016.
DOI : 10.1109/ICIP.2016.7532497
URL : https://hal.archives-ouvertes.fr/hal-01393959

R. Serizel, M. Moonen, B. Van-dijk, and J. Wouters, Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.4, pp.785-799, 2014.
DOI : 10.1109/TASLP.2014.2304240
URL : https://hal.archives-ouvertes.fr/hal-01390918

R. Showen, R. Calhoun, and J. Dunham, Acoustic location of gunshots using combined angle of arrival and time of arrival measurements, p.589, 2009.

C. Sigg, B. Fischer, B. Ommer, V. Roth, and J. Buhmann, Nonnegative CCA for Audiovisual Source Separation, 2007 IEEE Workshop on Machine Learning for Signal Processing, pp.253-258, 2007.
DOI : 10.1109/MLSP.2007.4414315
URL : http://hci.iwr.uni-heidelberg.de/people/bommer/papers/0_nonnegative_cca.pdf

P. Smaragdis and M. Casey, Audio Visual Independent Components, Proc Int Symp Indep Compon Anal Blind Signal Sep, pp.709-714, 2003.

N. Srivastava and R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Proc Adv Neural Inf Process Syst, pp.2222-2230, 2012.

N. Strobel, S. Spors, and R. Rabenstein, Joint audio-video object localization and tracking, IEEE Signal Processing Magazine, vol.18, issue.1, pp.22-31, 2001.
DOI : 10.1109/79.911196

Y. Tian, Z. Chen, and F. Yin, Distributed Kalman filter-based speaker tracking in microphone array networks, Applied Acoustics, vol.89, pp.71-77, 2015.
DOI : 10.1016/j.apacoust.2014.09.004

M. Togami and K. Hori, Multichannel semi-blind source separation via local Gaussian modeling for acoustic echo reduction, Proc Eur Signal Process Conf, 2011.

M. Togami and Y. Kawaguchi, Simultaneous Optimization of Acoustic Echo Reduction, Speech Dereverberation, and Noise Reduction against Mutual Interference, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue.11, pp.1612-1623, 2014.
DOI : 10.1109/TASLP.2014.2341918

V. Trifa, A. Koene, J. Moren, and G. Cheng, Real-time acoustic source localization in noisy environments for human-robot multimodal interaction, RO-MAN 2007, The 16th IEEE International Symposium on Robot and Human Interactive Communication, 2007.
DOI : 10.1109/ROMAN.2007.4415116

S. Valente, M. Tagliasacchi, F. Antonacci, P. Bestagini, A. Sarti et al., Geometric calibration of distributed microphone arrays from acoustic source correspondences, 2010 IEEE International Workshop on Multimedia Signal Processing, pp.13-18, 2010.
DOI : 10.1109/MMSP.2010.5661986

J. Valin, F. Michaud, and J. Rouat, Robust 3D Localization and Tracking of Sound Sources Using Beamforming and Particle Filtering, 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006.
DOI : 10.1109/ICASSP.2006.1661100
URL : http://www.gel.usherb.ca/laborius/papers/ICASSP2006.pdf

A. Velivelli, C. W. Ngo, and T. S. Huang, Detection of documentary scene changes by audiovisual fusion, Proc Int Conf Image Video Retr, pp.227-238, 2003.

E. Vincent, N. Bertin, R. Gribonval, and F. Bimbot, From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound, IEEE Signal Processing Magazine, vol.31, issue.3, pp.107-115, 2014.
DOI : 10.1109/MSP.2013.2297440
URL : https://hal.archives-ouvertes.fr/hal-00922378

L. Vuegen, B. V. Broeck, P. Karsmakers, H. V. Hamme, and B. Vanrumste, Automatic monitoring of activities of daily living based on real-life acoustic sensor data: a preliminary study, Proc Int Workshop Speech Lang Process Assist Technol, pp.113-118, 2013.

D. L. Wang, Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design, Trends in Amplification, vol.52, issue.20, pp.332-352, 2008.
DOI : 10.1109/TSP.2004.828896
URL : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111459/pdf

H. Wang and P. Chu, Voice source localization for automatic camera pointing system in videoconferencing, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997.
DOI : 10.1109/ICASSP.1997.599595

H. Wang, A. Kläser, C. Schmid, and C. L. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, pp.60-79, 2013.
DOI : 10.1007/s11263-006-9794-4
URL : https://hal.archives-ouvertes.fr/hal-00725627

Y. Wu, C. Y. Lin, E. Y. Chang, and J. R. Smith, Multimodal information fusion for video concept detection, Proc IEEE Int Conf Image Process, pp.2391-2394, 2004.

Z. Wu, Y. G. Jiang, J. Wang, J. Pu, and X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, Proceedings of the ACM International Conference on Multimedia, MM '14, pp.167-176, 2014.
DOI : 10.1038/nrn2331

K. Yilmaz and A. T. Cemgil, Probabilistic latent tensor factorisation, Proc Int Conf Latent Var Anal Signal Sep, pp.346-353, 2010.

N. Yokoya, T. Yairi, and A. Iwasaki, Coupled Nonnegative Matrix Factorization Unmixing for Hyperspectral and Multispectral Data Fusion, IEEE Transactions on Geoscience and Remote Sensing, vol.50, issue.2, pp.528-537, 2012.
DOI : 10.1109/TGRS.2011.2161320

J. Yoo and S. Choi, Matrix co-factorization on compressed sensing, Proc Int Joint Conf Artif Intell, 2011.

W. A. Yost, Discriminations of interaural phase differences, The Journal of the Acoustical Society of America, vol.55, issue.6, pp.1299-1303, 1974.
DOI : 10.1121/1.1914701

B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski, Integration of acoustic and visual speech signals using neural networks, IEEE Communications Magazine, vol.27, issue.11, pp.65-71, 1989.
DOI : 10.1109/35.41402

Q. Zhang, Z. Chen, and F. Yin, Distributed Marginalized Auxiliary Particle Filter for Speaker Tracking in Distributed Microphone Networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, issue.11, pp.1921-1934, 2016.
DOI : 10.1109/TASLP.2016.2590146

D. N. Zotkin and R. Duraiswami, Accelerated Speech Source Localization via a Hierarchical Search of Steered Response Power, IEEE Transactions on Speech and Audio Processing, vol.12, issue.5, pp.499-508, 2004.
DOI : 10.1109/TSA.2004.832990