Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Sanjeel Parekh; Alexey Ozerov; Slim Essid; Ngoc Q K Duong; Patrick Pérez; Gael Richard

Pré-Publication, Document De Travail Année : 2018

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

(1) , (1) , (2) , (1) , (3) , (2)

1
2
3

Sanjeel Parekh

Fonction : Auteur
PersonId : 743443
IdHAL : sanjeel-parekh
IdRef : 235511277

Technicolor R & I [Cesson Sévigné]

Alexey Ozerov

Fonction : Auteur
PersonId : 882775

Technicolor R & I [Cesson Sévigné]

Slim Essid

Fonction : Auteur
PersonId : 181234
IdHAL : slimessid
ORCID : 0000-0002-0028-327X
IdRef : 11025130X

Laboratoire Traitement et Communication de l'Information

Ngoc Q K Duong

Fonction : Auteur
PersonId : 864978

Technicolor R & I [Cesson Sévigné]

Patrick Pérez

Fonction : Auteur
PersonId : 1022281

Valeo.ai

Gael Richard

Fonction : Auteur
PersonId : 14146
IdHAL : gael-richard
IdRef : 094977208

Laboratoire Traitement et Communication de l'Information

Résumé

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.

Mots clés

multiple instance learning Index Terms-Audio-visual event detection non-negative matrix factorization audio-visual event detection source separation

Domaines

Traitement du signal et de l'image [eess.SP] Acoustique [physics.class-ph] Vision par ordinateur et reconnaissance de formes [cs.CV] Réseau de neurones [cs.NE]

Fichier principal

manuscript.pdf (731.05 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Alexey Ozerov : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01626389

Soumis le : jeudi 17 mai 2018-22:02:39

Dernière modification le : lundi 9 octobre 2023-12:49:40

Archivage à long terme le : mardi 25 septembre 2018-16:17:50

Dates et versions

hal-01626389 , version 1 (30-10-2017)

hal-01626389 , version 2 (17-05-2018)

hal-01626389 , version 3 (06-11-2018)

Identifiants

HAL Id : hal-01626389 , version 2

Citer

Sanjeel Parekh, Alexey Ozerov, Slim Essid, Ngoc Q K Duong, Patrick Pérez, et al.. Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision. 2018. ⟨hal-01626389v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM CNRS

832 Consultations

2230 Téléchargements

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager