Learning from narrated instruction videos

Jean-Baptiste Alayrac 1, 2, 3 Piotr Bojanowski 1, 2, 3 Nishant Agrawal 1, 2, 3 Josef Sivic 1, 2, 3 Ivan Laptev 1, 2, 3 Simon Lacoste-Julien 1, 3, 4
2 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
4 SIERRA - Statistical Machine Learning and Parsimony
DI-ENS - Département d'informatique de l'École normale supérieure, ENS Paris - École normale supérieure - Paris, CNRS - Centre National de la Recherche Scientifique, Inria de Paris
Abstract : Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.
Type de document :
Article dans une revue
IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, XX
Liste complète des métadonnées

Littérature citée [41 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01580630
Contributeur : Jean-Baptiste Alayrac <>
Soumis le : vendredi 1 septembre 2017 - 19:22:48
Dernière modification le : jeudi 11 janvier 2018 - 06:28:04
Document(s) archivé(s) le : samedi 2 décembre 2017 - 14:36:48

Fichier

pami2016alayrac.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01580630, version 1

Collections

Citation

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, et al.. Learning from narrated instruction videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2017, XX. 〈hal-01580630〉

Partager

Métriques

Consultations de la notice

153

Téléchargements de fichiers

137