Multichannel Speech Enhancement Based on Time-frequency Masking Using Subband Long Short-Term Memory

Xiaofei Li 1 Radu Horaud 1
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, INPG - Institut National Polytechnique de Grenoble
Abstract : We propose a multichannel speech enhancement method using along short-term memory (LSTM) recurrent neural network. The proposed method is developed in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the multichannel noisy STFT coefficient sequence to its corresponding STFT magnitude ratio mask sequence of one reference channel. This subband LSTM network exploits the differences between temporal/spatial characteristics of speech and noise, namely speech source is non-stationary and coherent, while noise is stationary and less spatially-correlated. Experiments with different types of noise show that the proposed method outperforms the baseline deep-learning-based full-band method and unsupervised method. In addition, since it does not learn the wideband spectral structure of either speech or noise, the proposed subband LSTM network generalizes very well to unseen speakers and noise types.
Complete list of metadatas

Cited literature [24 references]  Display  Hide  Download

https://hal.inria.fr/hal-02264247
Contributor : Team Perception <>
Submitted on : Monday, October 14, 2019 - 5:55:01 PM
Last modification on : Thursday, November 28, 2019 - 10:40:17 AM

Identifiers

  • HAL Id : hal-02264247, version 2

Collections

Citation

Xiaofei Li, Radu Horaud. Multichannel Speech Enhancement Based on Time-frequency Masking Using Subband Long Short-Term Memory. WASPAA 2019 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct 2019, New Paltz, NY, United States. pp.1-5. ⟨hal-02264247v2⟩

Share

Metrics

Record views

47

Files downloads

348