Neural speech turn segmentation and affinity propagation for speaker diarization

Abstract : Speaker diarization is the task of determining "who speaks when" in an audio stream. Most diarization systems rely on statistical models to address four sub-tasks: speech activity detection (SAD), speaker change detection (SCD), speech turn clustering, and re-segmentation. First, following the recent success of recurrent neural networks (RNN) for SAD and SCD, we propose to address re-segmentation with Long-Short Term Memory (LSTM) networks. Then, we propose to use affinity propagation on top of neural speaker embeddings for speech turn clustering, outperforming regular Hierarchical Agglomerative Clustering (HAC). Finally, all these modules are combined and jointly optimized to form a speaker diarization pipeline in which all but the clustering step are based on RNNs. We provide experimental results on the French Broadcast dataset ETAPE where we reach state-of-the-art performance.
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01912236
Contributor : Limsi Publications <>
Submitted on : Thursday, November 8, 2018 - 3:56:47 PM
Last modification on : Friday, July 19, 2019 - 2:44:04 PM
Long-term archiving on : Saturday, February 9, 2019 - 2:24:56 PM

File

neural-speech-turn(2).pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01912236, version 1

Citation

Ruiqing Yin, Hervé Bredin, Claude Barras. Neural speech turn segmentation and affinity propagation for speaker diarization. Annual Conference of the International Speech Communication Association, Sep 2018, Hyderabad, India. ⟨hal-01912236⟩

Share

Metrics

Record views

373

Files downloads

424