VPN: Learning Video-Pose Embedding for Activities of Daily Living

Srijan Das; Saurav Sharma; Rui Dai; Francois F Bremond; Monique Thonnat

Communication Dans Un Congrès Année : 2020

VPN: Learning Video-Pose Embedding for Activities of Daily Living

(1) , (1) , (1) , (1) , (1)

Srijan Das

Fonction : Auteur
PersonId : 21855
IdHAL : srijan-das

Spatio-Temporal Activity Recognition Systems

Saurav Sharma

Fonction : Auteur

Spatio-Temporal Activity Recognition Systems

Rui Dai

Fonction : Auteur

Spatio-Temporal Activity Recognition Systems

Francois F Bremond

Fonction : Auteur
PersonId : 20805
IdHAL : francois-bremond
ORCID : 0000-0003-2988-2142
IdRef : 138919046

Spatio-Temporal Activity Recognition Systems

Monique Thonnat

Fonction : Auteur

Spatio-Temporal Activity Recognition Systems

Résumé

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities-(i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

Mots clés

action recognition video pose embedding attention

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

ECCV2020_camera_ready.pdf (1.58 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

SRIJAN DAS : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02973787

Soumis le : mercredi 21 octobre 2020-11:49:10

Dernière modification le : mercredi 15 mars 2023-08:58:09

Archivage à long terme le : vendredi 22 janvier 2021-18:37:24

Dates et versions

hal-02973787 , version 1 (21-10-2020)

Identifiants

HAL Id : hal-02973787 , version 1

Citer

Srijan Das, Saurav Sharma, Rui Dai, Francois F Bremond, Monique Thonnat. VPN: Learning Video-Pose Embedding for Activities of Daily Living. ECCV 2020 - 16th European Conference on Computer Vision, Aug 2020, Glasgow (Virtual), United Kingdom. ⟨hal-02973787⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 UNIV-COTEDAZUR OPAL

333 Consultations

85 Téléchargements

VPN: Learning Video-Pose Embedding for Activities of Daily Living

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager