Skip to Main content Skip to Navigation
Journal articles

Learning from narrated instruction videos

Jean-Baptiste Alayrac 1, 2, 3 Piotr Bojanowski 1, 2, 3 Nishant Agrawal 1, 2, 3 Josef Sivic 1, 2, 3 Ivan Laptev 1, 2, 3 Simon Lacoste-Julien 1, 3, 4
2 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
4 SIERRA - Statistical Machine Learning and Parsimony
DI-ENS - Département d'informatique de l'École normale supérieure, CNRS - Centre National de la Recherche Scientifique, Inria de Paris
Abstract : Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.
Complete list of metadata

Cited literature [39 references]  Display  Hide  Download
Contributor : Jean-Baptiste Alayrac <>
Submitted on : Friday, September 1, 2017 - 7:22:48 PM
Last modification on : Tuesday, May 4, 2021 - 2:06:03 PM
Long-term archiving on: : Saturday, December 2, 2017 - 2:36:48 PM


Files produced by the author(s)


  • HAL Id : hal-01580630, version 1



Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, et al.. Learning from narrated instruction videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, XX. ⟨hal-01580630⟩



Record views


Files downloads