Visual voice activity detection as a help for speech source separation from convolutive mixtures

Bertrand Rivet; Laurent Girin; Christian Jutten

doi:10.1016/j.specom.2007.04.008

Article Dans Une Revue Speech Communication Année : 2007

Visual voice activity detection as a help for speech source separation from convolutive mixtures

(1) , (1) , (2)

1
2

Bertrand Rivet

Fonction : Auteur correspondant
PersonId : 1783
IdHAL : rivetb
ORCID : 0000-0003-4901-5302
IdRef : 113674422

Connectez-vous pour contacter l'auteur

GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face

Laurent Girin

Fonction : Auteur
PersonId : 3682
IdHAL : laurent-girin
ORCID : 0000-0002-9214-8760
IdRef : 088998037

GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face

Christian Jutten

Fonction : Auteur
PersonId : 4384
IdHAL : christianjutten
ORCID : 0000-0002-4477-4847
IdRef : 032689896

GIPSA - Signal Images Physique

Résumé

Audio–visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio–visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18–20 dB in signal to interference ratios are obtained for a wide range of (2 × 2) and (3 × 3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio–visual separation schemes.

Mots clés

Physical Sciences Convolutive mixtures Highly non-stationary environment Speech enhancement Speech source separation Visual speech processing

Domaines

Traitement du signal et de l'image [eess.SP] Traitement du signal et de l'image [eess.SP]

Fichier principal

PEER_stage2_10.1016%2Fj.specom.2007.04.008.pdf (1.86 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Hal Peer : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00499184

Soumis le : vendredi 9 juillet 2010-03:53:55

Dernière modification le : jeudi 4 avril 2024-21:36:39

Archivage à long terme le : jeudi 1 décembre 2016-05:02:03

Dates et versions

hal-00499184 , version 1 (09-07-2010)

Identifiants

HAL Id : hal-00499184 , version 1
DOI : 10.1016/j.specom.2007.04.008

Citer

Bertrand Rivet, Laurent Girin, Christian Jutten. Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication, 2007, 49 (7-8), pp.667-677. ⟨10.1016/j.specom.2007.04.008⟩. ⟨hal-00499184⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS OSUG GIPSA GIPSA-DIS GIPSA-DPC GIPSA-MPACIF GIPSA-SIGMAPHY PEER POLYTECH-GRENOBLE

228 Consultations

233 Téléchargements

Visual voice activity detection as a help for speech source separation from convolutive mixtures

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager