How2: A Large-scale Dataset for Multimodal Language Understanding

Ramon Sanabria; Ozan Caglayan; Shruti Palaskar; Desmond Elliott; Loïc Barrault; Lucia Specia; Florian Metze

Communication Dans Un Congrès Année : 2018

How2: A Large-scale Dataset for Multimodal Language Understanding

, (1) , (2) , (3) , (1) , (4) , (5)

1
2
3
4
5

Ramon Sanabria

Fonction : Auteur
PersonId : 1062423

Ozan Caglayan

Fonction : Auteur
PersonId : 16995
IdHAL : ozan-caglayan
ORCID : 0000-0002-5992-3470
IdRef : 237714582

Laboratoire d'Informatique de l'Université du Mans

Shruti Palaskar

Fonction : Auteur

Carnegie Mellon University [Pittsburgh]

Desmond Elliott

Fonction : Auteur

Department of Computing Science

Loïc Barrault

Fonction : Auteur
PersonId : 15276
IdHAL : loicbarrault
ORCID : 0000-0002-0634-6147
IdRef : 131912488

Laboratoire d'Informatique de l'Université du Mans

Lucia Specia

Fonction : Auteur

University of Sheffield [Sheffield]

Florian Metze

Fonction : Auteur

Interactive Systems Labs

Résumé

Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processing capabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multi-modal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities.

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Fichier principal

26.pdf (4.63 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Loïc BARRAULT : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02431947

Soumis le : mercredi 8 janvier 2020-11:37:15

Dernière modification le : mercredi 20 mars 2024-11:40:18

Archivage à long terme le : vendredi 10 avril 2020-00:22:48

Dates et versions

hal-02431947 , version 1 (08-01-2020)

Identifiants

HAL Id : hal-02431947 , version 1

Citer

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, et al.. How2: A Large-scale Dataset for Multimodal Language Understanding. NeurIPS, 2018, Montréal, Canada. ⟨hal-02431947⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LEMANS LIUM LIUM-LST

301 Consultations

81 Téléchargements

How2: A Large-scale Dataset for Multimodal Language Understanding

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager