A. Berard, O. Pietquin, C. Servan, and L. Besacier, Listen and translate: A proof of concept for end-toend speech-to-text translation, NIPS Workshop on End-to-end Learning for Speech and Audio Processing, 2016.

R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, Sequence-to-sequence models can directly transcribe foreign speech, 2017.

A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, End-to-End Automatic Speech Translation of Audiobooks, 2018.

, -IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.

S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater, Pre-training on high-resource speech recognition improves low-resource speech-totext translation, CoRR, 2018.

Y. Chung, W. Weng, S. Tong, and J. Glass, Towards unsupervised speech-to-text translation, CoRR, 2018.

Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson et al., Direct speechto-speech translation with a sequence-to-sequence model, CoRR, 2019.

Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao et al., Leveraging weakly supervised data to improve end-to-end speechto-text translation, CoRR, 2018.

M. Sperber, G. Neubig, J. Niehues, and A. Waibel, Attention-passing models for robust and dataefficient end-to-end speech translation, CoRR, 2019.

R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault et al., How2: a largescale dataset for multimodal language understanding, 2018.

M. A. Di-gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, Must-c: a multilingual speech translation corpus, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.2012-2017, 2019.

S. Meignier and T. Merlin, Lium spkdiarization: An open source toolkit for diarization, CMU SPUD Workshop, 2010.
URL : https://hal.archives-ouvertes.fr/hal-01433518

P. Daniel, G. Arnab, B. Gilles, B. Lukas, and G. Ondrej, The kaldi speech recognition toolkit, IEEE 2011 workshop on automatic speech recognition and understanding, 2011.

F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, Ted-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation, International Conference on Speech and Computer, pp.198-208, 2018.

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba et al., Espnet: End-to-end speech processing toolkit, 2018.

P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Trmal et al., A pitch extraction algorithm tuned for automatic speech recognition, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.2494-2498, 2014.

T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, Audio augmentation for speech recognition, Sixteenth Annual Conference of the International Speech Communication Association, 2015.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015, pp.3104-3112, 2015.

D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann et al., The kaldi speech recognition toolkit, IEEE 2011 workshop, 2011.

A. Stolcke, Srilm -an extensible language modeling toolkit, PROCEEDINGS OF THE 7TH IN-TERNATIONAL CONFERENCE ON SPOKEN LAN-GUAGE PROCESSING, pp.901-904, 2002.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in Neural Information Processing Systems, vol.30, pp.5998-6008, 2017.

M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross et al., fairseq: A fast, extensible toolkit for sequence modeling, Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1715-1725, 2016.