End-to-End Speech Recognition From the Raw Waveform

Abstract : State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.
Type de document :
Communication dans un congrès
Interspeech 2018, Sep 2018, Hyderabad, India. 〈10.21437/Interspeech.2018-2414〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01888739
Contributeur : Emmanuel Dupoux <>
Soumis le : vendredi 7 décembre 2018 - 14:40:40
Dernière modification le : vendredi 7 décembre 2018 - 17:51:56

Fichier

Zeghidour_USCD_2018_End2end_fr...
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux. End-to-End Speech Recognition From the Raw Waveform. Interspeech 2018, Sep 2018, Hyderabad, India. 〈10.21437/Interspeech.2018-2414〉. 〈hal-01888739〉

Partager

Métriques

Consultations de la notice

63

Téléchargements de fichiers

9