HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Abstract : In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities-(i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.
Document type :
Conference papers
Complete list of metadata

Cited literature [63 references]  Display  Hide  Download

Contributor : Srijan Das Connect in order to contact the contributor
Submitted on : Wednesday, October 21, 2020 - 11:49:10 AM
Last modification on : Thursday, January 20, 2022 - 5:31:12 PM
Long-term archiving on: : Friday, January 22, 2021 - 6:37:24 PM


Files produced by the author(s)


  • HAL Id : hal-02973787, version 1



Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat. VPN: Learning Video-Pose Embedding for Activities of Daily Living. ECCV 2020 - 16th European Conference on Computer Vision, Aug 2020, Glasgow (Virtual), United Kingdom. ⟨hal-02973787⟩



Record views


Files downloads