Learning from narrated instruction videos

Jean-Baptiste Alayrac 1, 2, 3 Piotr Bojanowski 1, 2, 3 Nishant Agrawal 1, 2, 3 Josef Sivic 1, 2, 3 Ivan Laptev 1, 2, 3 Simon Lacoste-Julien 1, 3, 4
2 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
4 SIERRA - Statistical Machine Learning and Parsimony
DI-ENS - Département d'informatique de l'École normale supérieure, CNRS - Centre National de la Recherche Scientifique, Inria de Paris
Abstract : Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.
Liste complète des métadonnées

Cited literature [39 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01580630
Contributor : Jean-Baptiste Alayrac <>
Submitted on : Friday, September 1, 2017 - 7:22:48 PM
Last modification on : Thursday, February 7, 2019 - 3:49:19 PM
Document(s) archivé(s) le : Saturday, December 2, 2017 - 2:36:48 PM

File

pami2016alayrac.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01580630, version 1

Collections

Citation

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, et al.. Learning from narrated instruction videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, XX. ⟨hal-01580630⟩

Share

Metrics

Record views

397

Files downloads

471