AVATAR: Unconstrained Audiovisual Speech Recognition

Valentin Gabeur; Paul Hongsuck Seo; Arsha Nagrani; Chen Sun; Karteek Alahari; Cordelia Schmid

Communication Dans Un Congrès Année : 2022

AVATAR: Unconstrained Audiovisual Speech Recognition

(1, 2) , (3) , (3) , (3) , (1) , (3)

1
2
3

Valentin Gabeur

Fonction : Auteur

Apprentissage de modèles à partir de données massives

Google Inc

Paul Hongsuck Seo

Fonction : Auteur

Google Inc.

Arsha Nagrani

Fonction : Auteur

Google Inc.

Chen Sun

Fonction : Auteur

Google Inc.

Karteek Alahari

Fonction : Auteur
PersonId : 19670
IdHAL : karteek
ORCID : 0000-0002-1838-5936
IdRef : 196283892

Apprentissage de modèles à partir de données massives

Cordelia Schmid

Fonction : Auteur

Google Inc.

Résumé

Audiovisual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, realworld test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.

Mots clés

video audiovisual multi-modal speech recognition

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Fichier principal

avalar_visspeech.pdf (10.57 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

THOTH Team : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03717330

Soumis le : vendredi 8 juillet 2022-10:36:35

Dernière modification le : jeudi 4 avril 2024-21:34:49

Archivage à long terme le : dimanche 9 octobre 2022-18:22:55

Dates et versions

hal-03717330 , version 1 (08-07-2022)

Identifiants

HAL Id : hal-03717330 , version 1

Citer

Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, et al.. AVATAR: Unconstrained Audiovisual Speech Recognition. INTERSPEECH 2022 - Conference of the International Speech Communication Association, Sep 2022, Incheon, South Korea. pp.1-6. ⟨hal-03717330⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LJK LJK_PS INRIA2 LJK-PS-STATIFY ANR

53 Consultations

19 Téléchargements

AVATAR: Unconstrained Audiovisual Speech Recognition

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager