Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Fabien Baradel; Christian Wolf; Julien Mille

Pré-Publication, Document De Travail Année : 2017

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

(1) , (1) , (1)

Fabien Baradel

Fonction : Auteur
PersonId : 14882
IdHAL : fabien-baradel
IdRef : 253130204

Extraction de Caractéristiques et Identification

Christian Wolf

Fonction : Auteur
PersonId : 3860
IdHAL : christian-wolf
ORCID : 0000-0001-9766-3211
IdRef : 083311696

Extraction de Caractéristiques et Identification

Julien Mille

Fonction : Auteur
PersonId : 7702
IdHAL : julien-mille
IdRef : 167739980

Extraction de Caractéristiques et Identification

Résumé

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV]

Christian Wolf : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01593548

Soumis le : mardi 26 septembre 2017-14:13:55

Dernière modification le : mercredi 5 juillet 2023-15:28:04

Dates et versions

hal-01593548 , version 1 (26-09-2017)

Identifiants

HAL Id : hal-01593548 , version 1
ARXIV : 1703.10106

Citer

Fabien Baradel, Christian Wolf, Julien Mille. Pose-conditioned Spatio-Temporal Attention for Human Action Recognition. 2017. ⟨hal-01593548⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UNIV-LYON1 UNIV-LYON2 INSA-LYON EC-LYON LIRIS LABEXIMU INSA-GROUPE UDL

675 Consultations

0 Téléchargements

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager