Optimization of RNN-based Speech Activity Detection - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue IEEE/ACM Transactions on Audio, Speech and Language Processing Année : 2018

Optimization of RNN-based Speech Activity Detection

Résumé

Speech activity detection (SAD) is an essential component of automatic speech recognition (ASR) systems impacting the overall system performance. This paper investigates an optimization process for recurrent neural network (RNN) based SAD. This process optimizes all system parameters including those used for feature extraction, the NN weights, and the back-end parameters. Three cost functions are considered for SAD optimization: the frame error rate (FER), the NIST detection cost function (DCF), and the word error rate (WER) of a downstream speech recognizer. Different types of RNN models and optimization methods are investigated. Three types of RNNs are compared: a basic RNN, an LSTM network with peepholes, and a coordinated-gate LSTM network introduced in [1]. Well suited for non-differentiable optimization problems, quantum-behaved particle swarm optimization (QPSO) is used to optimize feature extraction and posterior smoothing, as well as for the initial training of the neural networks. Experimental SAD results are reported on the NIST 2015 SAD evaluation data as well as REPERE and AMI meeting corpora. Speech recognition results are reported on the OpenKWS’13 test data. For all tasks and conditions, the proposed optimization method significantly improves the SAD performance and among all the tested SAD methods the CG-LSTM model gives the best results.
Fichier non déposé

Dates et versions

hal-02404747 , version 1 (11-12-2019)

Identifiants

  • HAL Id : hal-02404747 , version 1

Citer

Gregory Gelly, Jean-Luc Gauvain. Optimization of RNN-based Speech Activity Detection. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26, pp.646-656. ⟨hal-02404747⟩
144 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More