Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization

Abstract : Real-time speaker diarization has many potential applications, including public security, biometrics or forensics. It can also significantly speed up the indexing of increasingly large mul-timedia archives. In this paper, we address the issue of low-latency speaker diarization that consists in continuously detecting new or reoccurring speakers within an audio stream, and determining when each speaker is active with a low latency (e.g. every second). This is in contrast with most existing approaches in speaker diarization that rely on multiple passes over the complete audio recording. The proposed approach combines speaker turn neural embeddings with an incremental structure prediction approach inspired by state-of-the-art Natural Language Processing models for Part-of-Speech tagging and dependency parsing. It can therefore leverage both information describing the utterance and the inherent temporal structure of interactions between speakers to learn, in supervised framework , to identify speakers. Experiments on the Etape broadcast news benchmark validate the approach.
Document type :
Conference papers
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01690162
Contributor : Claude Barras <>
Submitted on : Tuesday, January 23, 2018 - 5:06:01 PM
Last modification on : Saturday, May 4, 2019 - 1:20:41 AM
Long-term archiving on : Thursday, May 24, 2018 - 8:55:24 AM

File

1067.PDF
Publisher files allowed on an open archive

Identifiers

Citation

Guillaume Wisniewksi, Hervé Bredin, Grégory Gelly, Claude Barras. Combining Speaker Turn Embedding and Incremental Structure Prediction for Low-Latency Speaker Diarization. Interspeech 2017, Aug 2017, Stockholm, Sweden. ⟨10.21437/Interspeech.2017-1067⟩. ⟨hal-01690162⟩

Share

Metrics

Record views

57

Files downloads

72