J. Hershey and J. Movellan, Audio-vision: Using audio-visual synchrony to locate sounds, NIPS, 2000.

J. Fisher and T. Darrell, Speaker association with signallevel audiovisual fusion, IEEE TMM, vol.6, issue.3, pp.406-413, 2004.
DOI : 10.1109/tmm.2004.827503
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.3704

T. Butz and J. Thiran, Feature space mutual information in speech-video sequences, Proceedings. IEEE International Conference on Multimedia and Expo, 2002.
DOI : 10.1109/ICME.2002.1035605

M. J. Beal, H. Attias, and N. Jojic, Audio-Video Sensor Fusion with Probabilistic Graphical Models, ECCV, 2002.
DOI : 10.1007/3-540-47969-4_49
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6759

P. Besson, V. Popovici, J. Vesin, M. Thiran, and . Kunt, Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection, IEEE Transactions on Multimedia, vol.10, issue.1, pp.63-73, 2008.
DOI : 10.1109/TMM.2007.911302

Z. Barzelay and Y. Schechner, Onsets coincidence for crossmodal analysis, IEEE TMM, vol.12, issue.2, pp.108-120, 2010.

A. Llagostera-casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval, Blind Audiovisual Source Separation Based on Sparse Redundant Representations, IEEE Transactions on Multimedia, vol.12, issue.5, pp.358-371, 2010.
DOI : 10.1109/TMM.2010.2050650
URL : https://hal.archives-ouvertes.fr/inria-00541412

V. Khalidov, F. Forbes, and R. Horaud, Conjugate Mixture Models for Clustering Multimodal Data, Neural Computation, vol.49, issue.3, pp.517-557, 2011.
DOI : 10.1007/978-94-011-3436-1
URL : https://hal.archives-ouvertes.fr/inria-00590267

V. Khalidov, F. Forbes, M. Hansard, E. Arnaud, and R. Horaud, Detection and localization of 3d audio-visual objects using unsupervised clustering, Proceedings of the 10th international conference on Multimodal interfaces, IMCI '08, 2008.
DOI : 10.1145/1452392.1452438
URL : https://hal.archives-ouvertes.fr/inria-00373148

X. Alameda-pineda, V. Khalidov, R. Horaud, and F. Forbes, Finding audio-visual events in informal social gatherings, Proceedings of the 13th international conference on multimodal interfaces, ICMI '11, 2011.
DOI : 10.1145/2070481.2070527
URL : https://hal.archives-ouvertes.fr/inria-00623489

F. Forbes, S. Doyle, D. Garcia-lorenzo, C. Barillot, and M. Dojat, A weighted multi-sequence markov model for brain lesion segmentation, AISTATS, 2010.
URL : https://hal.archives-ouvertes.fr/inserm-00723808

A. Deleforge, V. Drouard, L. Girin, and R. Horaud, Mapping sounds on images using binaural spectrograms, EUSIPCO, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01019287

A. Noulas, G. Englebienne, and B. J. Krose, Multimodal Speaker Diarization, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, issue.1, pp.79-93, 2012.
DOI : 10.1109/TPAMI.2011.47

V. Ferrari, M. Marin-jimenez, and A. Zisserman, Progressive search space reduction for human pose estimation, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587468

X. Zhu and D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, CVPR, 2012.

C. Keribin, Consistent estimation of the order of mixture models, Sankhya Series A, vol.62, issue.1, pp.49-66, 2000.