T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, 2020.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR, 2018.

A. Baevski, M. Auli, and A. Mohamed, Effectiveness of selfsupervised pre-training for speech recognition, 2019.

K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. Van-den-oord, Learning robust and multilingual speech representations, 2020.

Y. Chung and J. Glass, Generative pre-training for speech with autoregressive predictive coding, 2019.

Y. Chung, W. Hsu, H. Tang, and J. R. Glass, An unsupervised autoregressive model for speech representation learning, CoRR, 2019.

Y. Chung and J. Glass, Improved speech representations with multi-target autoregressive predictive coding, 2020.

S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: Unsupervised Pre-Training for Speech Recognition, Proc. Interspeech, pp.3465-3469, 2019.

J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu et al.,

J. Mazaré, V. Karadayi, R. Liptchinsky, C. Collobert, T. Fuegen et al., Libri-light: A benchmark for asr with limited or no supervision, 2019.

M. Rivière, A. Joulin, P. Mazaré, and E. Dupoux, Unsupervised pretraining transfers well across languages, 2020.

M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro et al., Multi-task self-supervised learning for robust speech recognition, 2020.

J. Engel, L. Hantrakul, C. Gu, and A. Roberts, Ddsp: Differentiable digital signal processing, 2020.

S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, Learning problem-agnostic speech representations from multiple self-supervised tasks, 2019.

A. Bérard, O. Pietquin, C. Servan, and L. Besacier, Listen and translate: A proof of concept for end-to-end speech-to-text translation, NIPS Workshop on End-to-end Learning for Speech and Audio Processing, 2016.

R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, Sequence-to-sequence models can directly transcribe foreign speech, Proc. of INTERSPEECH, 2017.

A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, End-to-end automatic speech translation of audiobooks, CoRR, 2018.

S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater, Pre-training on high-resource speech recognition improves lowresource speech-to-text translation, CoRR, 2018.

Y. Chung, W. Weng, S. Tong, and J. Glass, Towards unsupervised speech-to-text translation, CoRR, 2018.

Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson et al., Direct speech-to-speech translation with a sequence-to-sequence model, CoRR, 2019.

M. A. Di-gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, MuST-C: a Multilingual Speech Translation Corpus, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.2012-2017, 2019.

Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao et al., Leveraging weakly supervised data to improve end-to-end speech-totext translation, CoRR, 2018.

M. Sperber, G. Neubig, J. Niehues, and A. Waibel, Attentionpassing models for robust and data-efficient end-to-end speech translation, CoRR, 2019.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Lib-riSpeech: an ASR corpus based on public domain audio books, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5206-5210, 2015.

R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault et al., How2: a large-scale dataset for multimodal language understanding, ViGIL Workshop, NeurIPS, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02431947

H. Nguyen, N. Tomashenko, M. Z. Boito, A. Caubriere, F. Bougares et al., ON-TRAC consortium end-to-end speech translation systems for the IWSLT 2019 shared task, Proc. of IWSLT, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02352949

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. of ICLR, 2015.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, Proc. of ICLR, 2015.

H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. E. Soplin et al., ESPnet-ST: All-in-one speech translation toolkit, 2020.

J. Niehues, R. Cattoni, S. Stüker, M. Negri, M. Turchi et al., The iwslt 2019 evaluation campaign, Proceedings of the 16th International Workshop on Spoken Language Translation, 2019.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett et al., DARPA TIMIT acoustic phonetic continuous speech corpus cdrom, 1993.

N. Luong, L. Besacier, and B. Lecouteux, Towards accurate predictors of word quality for machine translation: Lessons learned on french -english and english -spanish systems, Data and Knowledge Engineering, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01147902

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The Kaldi speech recognition toolkit, Tech. Rep, 2011.

D. Snyder, D. Garcia-romero, G. Sell, D. Povey, and S. Khudanpur, X-vectors: Robust DNN embeddings for speaker recognition, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5329-5333, 2018.

A. Nagrani, J. S. Chung, and A. Zisserman, VoxCeleb: a largescale speaker identification dataset, pp.2616-2620, 2017.

N. Tomashenko, B. M. Srivastava, X. Wang, E. Vincent, A. Nautsch et al., Introducing the VoicePrivacy initiative, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02562199