DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Mandar Gogate; Ahsan Adeel; Ricard Marxer; Jon Barker; Amir Hussain

doi:10.21437/Interspeech.2018-2516

Communication Dans Un Congrès Année : 2018

DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

(1, 2) , (2) , (3, 4, 5) , (6) , (2)

1
2
3
4
5
6

Mandar Gogate

Fonction : Auteur

Birla Institute of Technology and Science

University of Stirling

Ahsan Adeel

Fonction : Auteur

University of Stirling

Ricard Marxer

Fonction : Auteur
PersonId : 19391
IdHAL : ricard-marxer
ORCID : 0000-0001-5099-5059
IdRef : 240437713

University of Sheffield [Sheffield]

Laboratoire d'Informatique et des Systèmes (LIS) (Marseille, Toulon)

DYNamiques de l’Information

Jon Barker

Fonction : Auteur
PersonId : 895549

Department of Computer Sciences [Scheffield]

Amir Hussain

Fonction : Auteur

University of Stirling

Résumé

Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited that leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate the significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.

Mots clés

Deep Neural Network Binary Mask Estimation Speech Separation Speech Enhancement

Domaines

Traitement du signal et de l'image [eess.SP] Intelligence artificielle [cs.AI] Informatique et langage [cs.CL]

Fichier principal

AVMaskInterspeech18 (3).pdf (489.29 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Ricard Marxer : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01868604

Soumis le : mercredi 5 septembre 2018-16:04:35

Dernière modification le : vendredi 22 mars 2024-18:24:03

Archivage à long terme le : jeudi 6 décembre 2018-17:49:09

Dates et versions

hal-01868604 , version 1 (05-09-2018)

Identifiants

HAL Id : hal-01868604 , version 1
DOI : 10.21437/Interspeech.2018-2516

Citer

Mandar Gogate, Ahsan Adeel, Ricard Marxer, Jon Barker, Amir Hussain. DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation. Interspeech 2018, Sep 2018, Hybderabad, India. pp.2723-2727, ⟨10.21437/Interspeech.2018-2516⟩. ⟨hal-01868604⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU LIS-LAB

115 Consultations

95 Téléchargements

DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager