Visual voice activity detection as a help for speech source separation from convolutive mixtures

Bertrand Rivet 1, * Laurent Girin 1 Christian Jutten 2
* Corresponding author
1 GIPSA-MPACIF - MPACIF
GIPSA-DPC - Département Parole et Cognition
2 GIPSA-SIGMAPHY - SIGMAPHY
GIPSA-DIS - Département Images et Signal
Abstract : Audio–visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audio–visual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18–20 dB in signal to interference ratios are obtained for a wide range of (2 × 2) and (3 × 3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio–visual separation schemes.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-00499184
Contributor : Hal Peer <>
Submitted on : Friday, July 9, 2010 - 3:53:55 AM
Last modification on : Monday, July 8, 2019 - 3:10:01 PM
Long-term archiving on : Thursday, December 1, 2016 - 5:02:03 AM

File

PEER_stage2_10.1016%2Fj.specom...
Files produced by the author(s)

Identifiers

Citation

Bertrand Rivet, Laurent Girin, Christian Jutten. Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication, Elsevier : North-Holland, 2007, 49 (7-8), pp.667-677. ⟨10.1016/j.specom.2007.04.008⟩. ⟨hal-00499184⟩

Share

Metrics

Record views

509

Files downloads

236