Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

Bertrand Rivet; Laurent Girin; Christian Jutten

doi:10.1109/TASL.2006.872619

Article Dans Une Revue IEEE Transactions on Audio, Speech and Language Processing Année : 2007

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

(1) , (1) , (2)

1
2

Bertrand Rivet

Fonction : Auteur
PersonId : 1783
IdHAL : rivetb
ORCID : 0000-0003-4901-5302
IdRef : 113674422

GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face

Laurent Girin

Fonction : Auteur
PersonId : 3682
IdHAL : laurent-girin
ORCID : 0000-0002-9214-8760
IdRef : 088998037

GIPSA - Machines Parlantes, Agents Communicants & Interaction Face-à-face

Christian Jutten

Fonction : Auteur
PersonId : 4384
IdHAL : christianjutten
ORCID : 0000-0002-4477-4847
IdRef : 032689896

GIPSA - Signal Images Physique

Résumé

Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures.

Mots clés

Audiovisual coherence blind source separation convolutive mixture speech enhancement statistical modeling

Domaines

Traitement du signal et de l'image [eess.SP] Traitement du signal et de l'image [eess.SP]

Fichier principal

Rivet-AVspeech-IEEE.pdf (743.9 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Christian Jutten : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00174100

Soumis le : vendredi 21 septembre 2007-15:07:49

Dernière modification le : jeudi 4 avril 2024-21:10:41

Archivage à long terme le : jeudi 8 avril 2010-22:05:00

Dates et versions

hal-00174100 , version 1 (21-09-2007)

Identifiants

HAL Id : hal-00174100 , version 1
DOI : 10.1109/TASL.2006.872619

Citer

Bertrand Rivet, Laurent Girin, Christian Jutten. Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures. IEEE Transactions on Audio, Speech and Language Processing, 2007, 15 (1), pp.96-108. ⟨10.1109/TASL.2006.872619⟩. ⟨hal-00174100⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS OSUG GIPSA GIPSA-DIS GIPSA-DPC GIPSA-MPACIF GIPSA-SIGMAPHY POLYTECH-GRENOBLE

229 Consultations

495 Téléchargements

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager