Skip to Main content Skip to Navigation
Conference papers

E2E-SINCNET: TOWARD FULLY END-TO-END SPEECH RECOGNITION

Abstract : Modern end-to-end (E2E) Automatic Speech Recognition (ASR) systems rely on Deep Neural Networks (DNN) that are mostly trained on handcrafted and pre-computed acoustic features such as Mel-filter-banks or Mel-frequency cepstral coefficients. Nonetheless , and despite worse performances, E2E ASR models processing raw waveforms are an active research field due to the lossless nature of the input signal. In this paper, we propose the E2E-SincNet, a novel fully E2E ASR model that goes from the raw waveform to the text transcripts by merging two recent and powerful paradigms: SincNet and the joint CTC-attention training scheme. The conducted experiments on two different speech recognition tasks show that our approach outperforms previously investigated E2E systems relying either on the raw waveform or pre-computed acoustic features, with a reported top-of-the-line Word Error Rate (WER) of 4.7% on the Wall Street Journal (WSJ) dataset.
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02484600
Contributor : Titouan Parcollet <>
Submitted on : Wednesday, February 19, 2020 - 3:00:30 PM
Last modification on : Wednesday, February 26, 2020 - 1:44:47 AM

File

ICASSP_2020___E2E_SINCNET-5.pd...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02484600, version 1

Collections

Citation

Titouan Parcollet, Mohamed Morchid, Georges Linares. E2E-SINCNET: TOWARD FULLY END-TO-END SPEECH RECOGNITION. ICASSP, May 2020, Barcelone, Spain. ⟨hal-02484600⟩

Share

Metrics

Record views

49

Files downloads

47