Masking Modalities for Cross-modal Video Retrieval

Valentin Gabeur; Arsha Nagrani; Chen Sun; Karteek Alahari; Cordelia Schmid

Communication Dans Un Congrès Année : 2022

Masking Modalities for Cross-modal Video Retrieval

(1, 2) , (2) , (2) , (1) , (2)

1
2

Valentin Gabeur

Fonction : Auteur

Apprentissage de modèles à partir de données massives

Google Inc.

Arsha Nagrani

Fonction : Auteur

Google Inc.

Chen Sun

Fonction : Auteur

Google Inc.

Karteek Alahari

Fonction : Auteur
PersonId : 19670
IdHAL : karteek
ORCID : 0000-0002-1838-5936
IdRef : 196283892

Apprentissage de modèles à partir de données massives

Cordelia Schmid

Fonction : Auteur

Google Inc.

Résumé

Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our 'modality masking' pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

MMCVR.pdf (5.23 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

THOTH Team : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03420133

Soumis le : mardi 9 novembre 2021-07:54:02

Dernière modification le : jeudi 4 avril 2024-18:17:38

Archivage à long terme le : jeudi 10 février 2022-18:11:35

Dates et versions

hal-03420133 , version 1 (09-11-2021)

Identifiants

HAL Id : hal-03420133 , version 1
ARXIV : 2111.01300

Citer

Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid. Masking Modalities for Cross-modal Video Retrieval. WACV 2022 - Winter Conference on Applications of Computer Vision, Jan 2022, Waikoloa, United States. pp.1-10. ⟨hal-03420133⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LJK LJK_GI INRIA2 LJK-GI-THOTH ANR

215 Consultations

72 Téléchargements

Masking Modalities for Cross-modal Video Retrieval

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager