D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The kaldi speech recognition toolkit, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, p.11, 2011.

M. Ravanelli, T. Parcollet, and Y. Bengio, The pytorch-kaldi speech recognition toolkit, Proc. of ICASSP, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02107617

A. Graves and N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks, International Conference on Machine Learning, pp.1764-1772, 2014.

Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent-yoshua et al., Towards end-to-end speech recognition with deep convolutional neural networks, 2017.

S. Kim, T. Hori, and S. Watanabe, Joint ctcattention based end-to-end speech recognition using multi-task learning, 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.4835-4839, 2017.

Y. Wang, T. Chen, H. Xu, S. Ding, H. Lv et al., Espresso: A fast endto-end neural speech recognition toolkit, 2019.

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, End-to-end speech recognition from the raw waveform, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888739

J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev et al., Jasper: An end-to-end convolutional neural acoustic model, 2019.

D. Bahdanau, J. Chorowski, and D. Serdyuk, End-to-end attention-based large vocabulary speech recognition, 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.4945-4949, 2016.

T. Hori, S. Watanabe, Y. Zhang, and W. Chan, Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm, 2017.

L. Dong, S. Xu, and B. Xu, Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5884-5888, 2018.

J. Salazar, K. Kirchhoff, and Z. Huang, Selfattention networks for connectionist temporal classification in speech recognition, ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.7115-7119, 2019.

D. Palaz, R. Collobert, and M. Doss, End-to-end phoneme sequence recognition using convolutional neural networks, 2013.

Z. Tüske, P. Golik, R. Schlüter, and H. Ney, Acoustic modeling with deep neural networks using raw time signal for lvcsr, Fifteenth annual conference of the international, 2014.

Y. Hoshen, R. J. Weiss, and K. W. Wilson, Speech acoustic modeling from raw multichannel waveforms, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4624-4628, 2015.

M. Ravanelli and Y. Bengio, Speaker recognition from raw waveform with sincnet, 2018.

M. Ravanelli and Y. Bengio, Speech and speaker recognition from raw waveform with sincnet, 2018.

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba et al., Espnet: End-to-end speech processing toolkit, 2018.

R. Lawrence, R. W. Rabiner, and . Schafer, Theory and applications of digital speech processing, vol.64, 2011.

S. Kumar, M. , and Y. Kuo, Digital signal processing: a computer-based approach, vol.2, 2006.

E. Loweimi, P. Bell, and S. Renals, On learning interpretable cnns with parametric modulated kernel-based filters, Proc. Interspeech, pp.3480-3484, 2019.

M. Ravanelli and Y. Bengio, Interpretable convolutional filters with sincnet, 2018.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning, pp.369-376, 2006.

M. Luong, H. Pham, and C. Manning, Effective approaches to attention-based neural machine translation, 2015.

D. Jan-k-chorowski, D. Bahdanau, K. Serdyuk, Y. Cho, and . Bengio, Attention-based models for speech recognition, Advances in neural information processing systems, pp.577-585, 2015.

S. John, L. F. Garofolo, . Lamel, M. William, J. G. Fisher et al., Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, NASA STI/Recon technical report n, vol.93, 1993.

D. Matthew and . Zeiler, Adadelta: an adaptive learning rate method, 2012.

A. Tjandra, S. Sakti, and S. Nakamura, Attention-based wav2text with feature transfer learning, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.309-315, 2017.

S. Karita, N. E. , Y. Soplin, S. Watanabe, M. Delcroix et al., Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration, 2019.