Skip to Main content Skip to Navigation
Conference papers

Learning Actionness via Long-range Temporal Order Verification

Abstract : Current methods for action recognition typically rely on supervision provided by manual labeling. Such methods, however, do not scale well given the high burden of manual video annotation and a very large number of possible actions. The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background. To address these challenges, we here propose a self-supervised and generic method to isolate actions from their background. We build on the observation that actions often follow a particular temporal order and, hence, can be predicted by other actions in the same video. As consecutive actions might be separated by minutes, differently to prior work on the arrow of time, we here exploit long-range temporal relations in 10-20 minutes long videos. To this end, we propose a new model that learns actionness via a self-supervised proxy task of order verification. The model assigns high actionness scores to clips which order is easy to predict from other clips in the video. To obtain a powerful and action-agnostic model, we train it on the large-scale unlabeled HowTo100M dataset with highly diverse actions from instructional videos. We validate our method on the task of action localization and demonstrate consistent improvements when combined with other recent weakly-supervised methods.
Document type :
Conference papers
Complete list of metadata
Contributor : Dimitri Zhukov Connect in order to contact the contributor
Submitted on : Wednesday, December 9, 2020 - 3:11:22 PM
Last modification on : Wednesday, June 8, 2022 - 12:50:06 PM
Long-term archiving on: : Wednesday, March 10, 2021 - 7:26:37 PM


Files produced by the author(s)


  • HAL Id : hal-03048753, version 1



Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic. Learning Actionness via Long-range Temporal Order Verification. ECCV 2020 - European Conference on Computer Vision, Aug 2020, Glasgow / Virtual, United Kingdom. ⟨hal-03048753⟩



Record views


Files downloads