C. Barras, X. Zhu, S. Meignier, and J. L. Gauvain, Multistage speaker diarization of broadcast news, IEEE Transactions on Audio, Speech and Language Processing, vol.14, issue.5, pp.1505-1512, 2006.
DOI : 10.1109/TASL.2006.878261
URL : https://hal.archives-ouvertes.fr/hal-01434241

M. Bäuml, M. Tapaswi, and R. Stiefelhagen, Semisupervised Learning with Constraints for Person Identification in Multimedia Data, International Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

J. Bergstra and Y. Bengio, Random Search for Hyper- Parameter Optimization, J. Mach. Learn. Res, vol.13, pp.281-305, 2012.

V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment, vol.2008, issue.10, 2008.
DOI : 10.1088/1742-5468/2008/10/P10008
URL : https://hal.archives-ouvertes.fr/hal-01146070

H. Bredin and G. Chollet, Audio-Visual Speech Synchrony Measure: Application to, Special Issue on Knowledge-Assisted Media Analysis for Interactive Multimedia Applications, 2007.
DOI : 10.1155/2007/70186
URL : https://doi.org/10.1155/2007/70186

H. Bredin and J. Poignant, Integer Linear Programming for Speaker Diarization and Cross-Modal Identification in TV Broadcast, Interspeech 2013, 14th Annual Conference of the International Speech Communication Association, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00953095

L. Canseco, L. Lamel, and J. L. Gauvain, A comparative study using manual and automatic transcriptions for diarization, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pp.415-419, 2005.
DOI : 10.1109/ASRU.2005.1566507
URL : https://www.lrde.epita.fr/~reda/cours/speech/speakerDiarization/1566507.pdf

S. S. Chen and P. Gopalakrishnan, Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. In: DARPA Broadcast News Transcription and Understanding Workshop, 1998.

T. Cour, B. Sapp, A. Nagle, and B. Taskar, Talking pictures: Temporal grouping and dialog-supervised person recognition, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5540106
URL : http://www.seas.upenn.edu/%7Etimothee/papers/cvpr_2010.pdf

N. Dimitrova, H. J. Zhang, B. Shahraray, I. Sezan, T. Huang et al., Applications of video-content analysis and retrieval, IEEE Multimedia, vol.9, issue.3, pp.42-55, 2002.
DOI : 10.1109/MMUL.2002.1022858

M. Dinarelli and S. Rosset, Models Cascade for Tree- Structured Named Entity Detection Asian Federation of Natural Language Processing, Proceedings of 5th International Joint Conference on Natural Language Processing, pp.1269-1278, 2011.

G. Dupuy, M. Rouvier, S. Meignier, and Y. Estève, i- Vectors and ILP Clustering Adapted to Cross-Show Speaker Diarization, Interspeech 2012, 13th Annual Conference of the International Speech Communication Association, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01450711

Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair, Extracting true speaker identities from transcriptions, Proceedings of Interspeech, pp.2601-2604, 2007.

J. R. Finkel and C. D. Manning, Enforcing transitivity in coreference resolution, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers, HLT '08, 2008.
DOI : 10.3115/1557690.1557703
URL : http://nlp.stanford.edu/cmanning/papers/acl08_coref_ilp_final.pdf

J. G. Fiscus, J. S. Garofolo, A. N. Le, A. F. Martin, D. S. Pallett et al., Results of the Fall 2004 STT and MDE Evaluation, Rich Transcription Workshop, 2004.

J. L. Gauvain, L. Lamel, and G. Adda, Partitioning and Transcription of Broadcast News Data, Proceedings of International Conference on Spoken Language Processing (ICSLP 98), pp.1335-1338, 1998.

J. L. Gauvain, L. Lamel, and G. Adda, The LIMSI Broadcast News transcription system, Speech Communication, vol.37, issue.1-2, pp.89-109, 2002.
DOI : 10.1016/S0167-6393(01)00061-9
URL : https://hal.archives-ouvertes.fr/hal-01434493

J. L. Gauvain and C. H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Transactions on Speech and Audio Processing, vol.2, issue.2, pp.291-298, 1994.
DOI : 10.1109/89.279278

A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert et al., The REPERE Corpus: a Multimodal Corpus for Person Recognition, International Conference on Language Resources and Evaluation (LREC), 2012.

G. Gravier, G. Adda, N. Paulson, M. Carré, A. Giraudel et al., The ETAPE Corpus for the Evaluation of Speech-based TV Content processing in the French language, International Conference on Language Resources , Evaluation and Corpora. Turkey, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00712591

G. Optimization and . Inc, Gurobi Optimizer Reference Manual, 2012.

H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, vol.87, issue.4, pp.1738-1752, 1990.
DOI : 10.1121/1.399423

A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM Computing Surveys, vol.31, issue.3, pp.264-323, 1999.
DOI : 10.1145/331499.331504

V. Jousse, S. Petitrenaud, S. Meignier, Y. Estève, and C. Jacquin, Automatic named identification of speakers using diarization and ASR systems, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009.
DOI : 10.1109/ICASSP.2009.4960644
URL : https://hal.archives-ouvertes.fr/hal-00412431

J. Lawto, J. L. Gauvain, L. Lamel, G. Grefenstette, G. Gravier et al., A Scalable Video Search Engine Based on Audio Content Indexing and Topic Segmentation, Networked and Electronic Media (NEM) Summit : Implementing Future Media Internet, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00645228

V. B. Le, C. Barras, and M. Ferras, On the use of GSV-SVM for Speaker Diarization and Tracking, Proceedings of Odyssey 2010 -The Speaker and Language Recognition Workshop, pp.146-150, 2010.

J. Mauclair, S. Meignier, and Y. Estève, Speaker Diarization: About whom the Speaker is Talking ?, 2006 IEEE Odyssey, The Speaker and Language Recognition Workshop, 2006.
DOI : 10.1109/ODYSSEY.2006.248114
URL : https://hal.archives-ouvertes.fr/hal-01434121

S. Mouysset, J. Noailles, D. Ruiz, and R. Guivarch, On a Strategy for Spectral Clustering with Parallel Computation . High Performance Computing for Computational Science?VECPAR, pp.408-420, 2010.

M. E. Newman, Modularity and community structure in networks, Proceedings of the National Academy of Sciences, vol.68, issue.6804, pp.8577-8582, 2006.
DOI : 10.1073/pnas.021544898
URL : http://www.pnas.org/content/103/23/8577.full.pdf

J. Y. Pan, H. J. Yang, and C. Faloutsos, MMSS: Multi-modal Story-oriented Video Summarization, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM), 2004.

J. Y. Pan, H. J. Yang, C. Faloutsos, and P. Duygulu, Automatic multimedia cross-modal correlation discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '04, 2004.
DOI : 10.1145/1014052.1014135
URL : http://www.cs.bilkent.edu.tr/%7Eduygulu/papers/KDD2004.pdf

J. Pelecanos and S. Sridharan, Feature Warping for Robust Speaker Verification, Proceedings of Odyssey 2001 - The Speaker Recognition Workshop, pp.213-218, 2001.

D. Pelleg and A. W. Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters, Proceedings of the Seventeenth International Conference on Machine Learning, ICML '00, pp.727-734

J. Poignant, L. Besacier, V. B. Le, S. Rosset, and G. Quénot, Unsupervised Naming of Speakers in Broadcast TV: using Written Names, Pronounced Names or Both? In: Interspeech 2013, 14th Annual Conference of the International Speech Communication Association, 2013.

J. Poignant, L. Besacier, G. Quénot, and F. Thollard, From Text Detection in Videos to Person Identification, 2012 IEEE International Conference on Multimedia and Expo, 2012.
DOI : 10.1109/ICME.2012.119
URL : https://hal.archives-ouvertes.fr/hal-00767383

J. Poignant, H. Bredin, V. B. Le, L. Besacier, C. Barras et al., Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast, Interspeech 2012, 13th Annual Conference of the International Speech Communication Association. Portland, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00767427

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing, vol.10, issue.1-3, pp.1-3, 2000.
DOI : 10.1006/dspr.1999.0361
URL : http://www.cse.ohio-state.edu/~dwang/teaching/cse788/papers/Reynolds-dsp00.pdf

A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content-based image retrieval at the end of the early years, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22, issue.12, pp.1349-1380, 2000.
DOI : 10.1109/34.895972

R. Smith, An Overview of the Tesseract OCR Engine, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, pp.629-633, 2007.
DOI : 10.1109/ICDAR.2007.4376991

S. E. Tranter, Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio, 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, pp.1013-1016, 2006.
DOI : 10.1109/ICASSP.2006.1660195
URL : http://mi.eng.cam.ac.uk/reports/svr-ftp/tranter_icassp06.pdf

Y. Wang, Z. Liu, and J. C. Huang, Multimedia content analysis-using both audio and visual clues, IEEE Signal Processing Magazine, vol.17, issue.6, pp.12-36, 2000.
DOI : 10.1109/79.888862