Skip to Main content Skip to Navigation
Conference papers

How2: A Large-scale Dataset for Multimodal Language Understanding

Abstract : Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processing capabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multi-modal datasets to support such research, resulting in a limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of instructional videos covering a wide variety of topics across 80,000 clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. In addition to being multimodal, How2 is multilingual: we crowdsourced Portuguese translations of the subtitles. We present results for monomodal and multimodal baselines on several language processing tasks with interesting insights on the utility of different modalities. We hope that by making the How2 dataset and baselines available we will encourage collaboration across language, speech and vision communities.
Complete list of metadatas

Cited literature [58 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02431947
Contributor : Loïc Barrault <>
Submitted on : Wednesday, January 8, 2020 - 11:37:15 AM
Last modification on : Tuesday, March 3, 2020 - 2:55:31 PM
Document(s) archivé(s) le : Friday, April 10, 2020 - 12:22:48 AM

File

26.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02431947, version 1

Collections

Citation

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, et al.. How2: A Large-scale Dataset for Multimodal Language Understanding. NeurIPS, 2018, Montréal, Canada. ⟨hal-02431947⟩

Share

Metrics

Record views

29

Files downloads

20