C. S. De-almeida, J. Cousty, B. Perret, Z. K. Do-patrocínio, and S. J. Guimarães, Label propagation guided by hierarchy of partitions for superpixel computation, Image Analysis and Processing -ICIAP 2019 -20th International Conference, vol.11752, pp.3-13, 2019.

P. K. Atrey, M. A. Hossain, A. El-saddik, and M. S. Kankanhalli, Multimodal fusion for multimedia analysis: a survey, Multimedia systems, vol.16, issue.6, pp.345-379, 2010.

M. Azab, M. Wang, M. Smith, N. Kojima, J. Deng et al., Speaker naming in movies, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.2206-2216, 2018.

F. Bechet, M. Bendris, D. Charlet, G. Damnati, B. Favre et al., Multimodal understanding for person recognition in video broadcasts, International Conference on Spoken Language Processing (ICSLP), pp.607-611, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01194244

M. Ben, M. Betser, F. Bimbot, and G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, Proceedings of the 8th International Conference on Spoken Language Processing, pp.333-444, 2004.

E. A. Bernal, X. Yang, Q. Li, J. Kumar, S. Madhvanath et al., Deep temporal multimodal fusion for medical procedure monitoring using wearable sensors, IEEE Transactions on Multimedia, vol.20, issue.1, pp.107-118, 2017.

H. Bredin, C. Barras, and C. Guinaudeau, Multimodal person discovery in broadcast TV at MediaEval, Working notes of the MediaEval 2016 Workshop, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01690330

H. Bredin, A. Roy, V. B. Le, and C. Barras, Person Instance Graphs for Mono-, Crossand Multi-Modal Person Recognition in Multimedia Data. Application to Speaker Identification in TV Broadcast, International Journal of Multimedia Information Retrieval, 2014.

L. Canseco, L. Lamel, and J. L. Gauvain, A comparative study using manual and automatic transcriptions for diarization, IEEE Workshop on Automatic Speech Recognition and Understanding, pp.415-419, 2005.

L. Canseco-rodriguez, L. Lamel, and J. L. Gauvain, Speaker diarization from speech transcripts, International Conference on Spoken Language Processing (ICSLP), pp.1272-1275, 2004.

E. Cayllahua-cahuina, J. Cousty, S. J. Guimarães, Y. Kenmochi, G. Cámara-chávez et al., Hierarchical segmentation from a non-increasing edge observation attribute, Pattern Recognition Letters, vol.131, pp.105-112, 2020.

D. Chen and J. M. Odobez, Video text recognition using sequential Monte Carlo and error voting methods, Pattern Recognition Letters, vol.26, issue.9, pp.1386-1403, 2005.

J. Cousty, L. Najman, Y. Kenmochi, and S. Guimarães, Hierarchical segmentations with graphs: Quasi-flat zones, minimum spanning trees, and saliency maps, Journal of Mathematical Imaging and Vision, vol.60, issue.4, pp.479-502, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01344727

G. B. Da-fonseca, I. L. Freire, Z. Patrocínio, S. J. Guimarães, G. Sargent et al., Tag propagation approaches within speaking face graphs for multimodal person discovery, Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (CBMI), p.15, 2017.

N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.1, pp.886-893, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

M. Danelljan, G. Häger, F. Shahbaz-khan, and M. Felsberg, Accurate scale estimation for robust visual tracking, Proceedings of the British Machine Vision Conference, 2014.

B. V. Dasarathy, Decision fusion, 1994.

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.4, pp.788-798, 2011.

Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair, Extracting true speaker identities from transcriptions, International Conference on Spoken Language Processing (IC-SLP), pp.2601-2604, 2007.

O. Galibert and J. Kahn, The first official repere evaluation, First Workshop on Speech, Language and Audio for Multimedia, 2013.

D. Garcia-romero and C. Y. Espy-wilson, Analysis of i-vector length normalization in speaker recognition systems, 12th Annual Conference of the International Speech Communication Association, 2011.

P. Gay, G. Dupuy, C. Lailler, J. M. Odobez, S. Meignier et al., Comparison of two methods for unsupervised person identification in tv shows, 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp.1-6, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433260

J. Geng, Z. Miao, and X. P. Zhang, Efficient heuristic methods for multimodal fusion and concept fusion in video concept detection, IEEE Transactions on Multimedia, vol.17, issue.4, pp.498-511, 2015.

R. Houghton, Named faces: putting names to faces, IEEE Intelligent Systems and their Applications, vol.14, issue.5, pp.45-50, 1999.

Y. Hu, J. S. Ren, J. Dai, C. Yuan, L. Xu et al., Deep multimodal speaker naming, Proceedings of the 23rd ACM International Conference on Multimedia, pp.1107-1110, 2015.

J. Kahn, O. Galibert, L. Quintard, M. Carré, A. Giraudel et al., A presentation of the repere challenge, 10th International Workshop on Content-Based Multimedia Indexing (CBMI), pp.1-6, 2012.

E. Kakaletsis, O. Zoidi, I. Tsingalis, A. Tefas, N. Nikolaidis et al., Fast constrained person identity label propagation in stereo videos using a pruned similarity matrix, Signal Processing: Image Communication, vol.67, pp.199-209, 2018.

D. Lahat, T. Adali, and C. Jutten, Multimodal data fusion: An overview of methods, challenges, and prospects, Proceedings of the IEEE, vol.103, issue.9, pp.1449-1477, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01179853

J. R. Landis and G. G. Koch, The measurement of observer agreement for categorical data, Biometrics, vol.33, issue.1, pp.159-174, 1977.

N. Le, H. Bredin, G. Sargent, P. Lopez-otero, C. Barras et al., Towards large scale multimedia indexing: A case study on person discovery in broadcast news, Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (CBMI), p.18, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01551690

N. Le, S. Meignier, and J. M. Odobez, Eumssi team at the mediaeval person discovery challenge, Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01433209

Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, N. Sebe et al., Discriminating joint feature analysis for multimedia data understanding, IEEE Transactions on Multimedia, vol.14, issue.6, pp.1662-1672, 2012.

G. Martí, C. Cortillas, G. Bouritsas, E. Sayrol, J. R. Morros et al., Upc system for the 2016 mediaeval multimodal person discovery in broadcast tv task, Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.

N. Masuda, M. A. Porter, and R. Lambiotte, Random walks and diffusion on networks, Physics Reports, pp.1-58, 2017.

J. Mauclair, S. Meignier, and Y. Esteve, Speaker diarization: About whom the speaker is talking? In: IEEE Odyssey -The Speaker and Language Recognition Workshop, pp.1-6, 2006.

L. Najman and M. Couprie, Building the component tree in quasi-linear time, IEEE Transactions on Image Processing, vol.15, issue.11, pp.3531-3539, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00622110

V. T. Nguyen, M. T. Nguyen, Q. H. Che, V. T. Ninh, T. K. Le et al., Hcmus team at the multimodal person discovery in broadcast tv task of mediaeval, Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.

F. Nishi, N. Inoue, K. Iwano, and K. Shinoda, Tokyo tech at mediaeval 2016 multimodal person discovery in broadcast tv task, Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911179

P. L. Otero, L. Docio-fernandez, and C. G. Mateo, Gtm-uvigo system for multimodal person discovery in broadcast tv task at mediaeval, Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.

L. Pang and C. W. Ngo, Unsupervised celebrity face naming in web videos, IEEE Transactions on Multimedia, vol.17, issue.6, pp.854-866, 2015.

B. Perret, J. Cousty, S. J. Guimarães, and D. S. Maia, Evaluation of hierarchical watersheds, IEEE Trans. Image Processing, vol.27, issue.4, pp.1676-1688, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01430865

B. Perret, J. Cousty, J. C. Ura, and S. J. Guimarães, Evaluation of morphological hierarchies for supervised segmentation, Proceedings of the 12th International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing, pp.39-50, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01142072

P. T. Pham, M. Moens, and T. Tuytelaars, Cross-media alignment of names and faces, IEEE Transactions on Multimedia, vol.12, issue.1, pp.13-27, 2010.

S. Pini, M. Cornia, F. Bolelli, L. Baraldi, and R. Cucchiara, M-vad names: a dataset for video captioning with naming, Multimedia Tools and Applications, vol.78, issue.10, p.27, 2019.

J. Poignant, L. Besacier, and G. Quénot, Unsupervised speaker identification in tv broadcast based on written names, IEEE Transactions on Audio, Speech, and Language Processing, vol.23, issue.1, pp.57-68, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01060827

J. Poignant, H. Bredin, and C. Barras, Multimodal person discovery in broadcast TV at mediaeval, Working Notes Proceedings of the MediaEval 2015 Workshop, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01690332

J. Poignant, H. Bredin, and C. Barras, Multimodal person discovery in broadcast tv: lessons learned from mediaeval 2015, Multimedia Tools and Applications, vol.76, issue.21, pp.547-569, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01690581

J. Poignant, G. Fortier, L. Besacier, and G. Quénot, Naming multi-modal clusters to identify persons in TV broadcast, Multimedia Tools and Applications, vol.75, issue.15, pp.8999-9023, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01230628

C. Raymond, Robust tree-structured named entities recognition from speech, International Conference on Acoustics, Speech and Signal Processing, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00830142

A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, Visual instance retrieval with deep convolutional networks, ITE Transactions on Media Technology and Applications, vol.4, issue.3, pp.251-258, 2016.

A. Rohrbach, M. Rohrbach, S. Tang, S. Joon-oh, and B. Schiele, Generating descriptions with grounded and co-referenced people, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4979-4989, 2017.

M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin et al., An open-source state of the art toolbox for broadcast news diarization, pp.25-29, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01433449

J. Sang and C. Xu, Robust face-name graph matching for movie character identification, IEEE Transactions on Multimedia, vol.14, issue.3, pp.586-596, 2012.

S. Dos, C. E. Gravier, G. Robson-schwartz, and W. , SSIG and IRISA at Multimodal Person Discovery, Working Notes Proceedings of the MediaEval, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01196171

S. Satoh, Y. Nakamura, and T. Kanade, Name-it: naming and detecting faces in news videos, IEEE MultiMedia, vol.6, issue.1, pp.22-35, 1999.

F. Schroff, D. Kalenichenko, and J. Philbin, Facenet: A unified embedding for face recognition and clustering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.815-823, 2015.

R. Sicre, J. Rabin, Y. Avrithis, T. Furon, F. Jurie et al., Automatic discovery of discriminative parts as a quadratic assignment problem, Proceedings of the IEEE International Conference on Computer Vision, pp.1059-1068, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02370324

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR, 2015.

K. Somandepalli, N. Kumar, T. Guha, and S. S. Narayanan, Unsupervised discovery of character dictionaries in animation movies, IEEE Transactions on Multimedia, vol.20, issue.3, pp.539-551, 2018.

G. Tolias, R. Sicre, and H. Jégou, Particular object retrieval with integral max-pooling of cnn activations. International Conference on Learning Representations (ICLR, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01842218

S. E. Tranter, Who really spoke when? finding speaker turns and identities in broadcast news audio, IEEE ICASSP, vol.1, 2006.

T. Tuytelaars and M. F. Moens, Naming people in news videos with label propagation, IEEE Multimedia, vol.18, issue.3, pp.44-55, 2011.

F. Vallet, S. Essid, and J. Carrive, A multimodal approach to speaker diarization on tv talk-shows, IEEE Transactions on Multimedia, vol.15, issue.3, pp.509-520, 2013.

J. Wu, S. Zhao, V. S. Sheng, J. Zhang, C. Ye et al., Weak-labeled active learning with conditional label dependence for multilabel image classification, IEEE Transactions on Multimedia, vol.19, issue.6, pp.1156-1169, 2017.

C. Xiong, G. Gao, Z. Zha, S. Yan, H. Ma et al., Adaptive learning for celebrity identification with video context, IEEE Transactions on Multimedia, vol.16, issue.5, pp.1473-1485, 2014.

J. Yang and A. G. Hauptmann, Naming every individual in news video monologues, Proceedings of the 12th ACM International Conference on Multimedia, pp.580-587, 2004.

J. Yang, R. Yan, and A. G. Hauptmann, Multiple instance learning for labeling faces in broadcasting news video, Proceedings of the 13th ACM International Conference on Multimedia, pp.31-40, 2005.

H. Yu, F. He, and Y. Pan, A novel region-based active contour model via local patch similarity measure for image segmentation, Multimedia Tools and Applications, vol.77, issue.18, pp.97-121, 2018.

H. Yu, F. He, and Y. Pan, A novel segmentation model for medical images with intensity inhomogeneity based on adaptive perturbation, Multimedia Tools and Applications, vol.78, issue.9, pp.779-790, 2019.

H. Yu, F. He, and Y. Pan, A scalable region-based level set method using adaptive bilateral filter for noisy image segmentation, Multimedia Tools and Applications, vol.79, issue.9, pp.5743-5765, 2020.

X. Zhang, L. Zhang, X. J. Wang, and H. Y. Shum, Finding celebrities in billions of web images, IEEE Transactions on Multimedia, vol.14, issue.4, pp.995-1007, 2012.

Y. Zhang, Z. Tang, B. Wu, Q. Ji, and H. Lu, A coupled hidden conditional random field model for simultaneous face clustering and naming in videos, IEEE Transactions on Image Processing, vol.25, issue.12, pp.5780-5792, 2016.

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, Learning with local and global consistency, Advances in neural information processing systems, pp.321-328, 2004.

X. J. Zhu, Semi-supervised learning literature survey, vol.2, 2008.

O. Zoidi, A. Tefas, N. Nikolaidis, and I. Pitas, Person identity label propagation in stereo videos, IEEE Transactions on Multimedia, vol.16, issue.5, pp.1358-1368, 2014.