E. Ansari, A. Axelrod, N. Bach, O. Bojar, R. Cattoni et al., 2020. Findings of the IWSLT 2020 Evaluation Campaign, Proceedings of the 17th International Conference on Spoken Language Translation

N. Arivazhagan, C. Cherry, W. Macherey, C. Chiu, S. Yavuz et al., Monotonic infinite lookback attention for simultaneous machine translation, Proc. of ACL, 2019.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, Proc. of ICLR, 2015.

A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, End-to-End Automatic Speech Translation of Audiobooks, Proc. of ICASSP, 2018.

A. Bérard, O. Pietquin, C. Servan, and L. Besacier, Listen and translate: A proof of concept for end-to-end speech-to-text translation, NIPS Workshop on End-to-end Learning for Speech and Audio Processing, 2016.

F. Dalvi, N. Durrani, H. Sajjad, and S. Vogel, Incremental decoding and training methods for simultaneous translation in neural machine translation, Proc. of NAACL-HLT, 2018.

N. Dehak, J. Patrick, R. Kenny, P. Dehak, P. Dumouchel et al., Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, 2010.

M. Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, Must-c: a multilingual speech translation corpus, Proc. of NAACL-HLT, 2019.

P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Trmal et al., A pitch extraction algorithm tuned for automatic speech recognition, Proc. of ICASSP, 2014.

F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation, ternational Conference on Speech and Computer, 2018.

J. Iranzo-sánchez, J. A. Silvestre-cerdà, J. Jorge, N. Roselló, A. Giménez et al., Europarl-ST: A multilingual corpus for speech translation of parliamentary debates, Proc. of ICASSP, 2020.

Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao et al., Leveraging weakly supervised data to improve end-to-end speech-to-text translation, Proc. of ICASSP, 2019.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, Proc. of ICLR, 2015.

T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, Audio augmentation for speech recognition, Proc. of INTERSPEECH, 2015.

P. Koehn, Europarl: A parallel corpus for statistical machine translation, MT summit, 2005.

T. Kudo and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, Proc. of EMNLP: System Demonstrations, 2018.

M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu et al., STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework, Proc. of ACL, 2019.

X. Ma, J. Pino, J. Cross, L. Puzon, and J. Gu, Monotonic multihead attention, Proc. of ICLR, 2020.

H. Nguyen, N. Tomashenko, . Marcely-zanon, A. Boito, F. Caubriere et al., ON-TRAC consortium end-to-end speech translation systems for the IWSLT 2019 shared task, Proc. of IWSLT, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02352949

J. Niehues, N. Pham, T. Ha, M. Sperber, and A. Waibel, Lowlatency neural speech translation, Proc. of INTER-SPEECH, 2018.

M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross et al., fairseq: A fast, extensible toolkit for sequence modeling, Proc. of NAACL-HLT: Demonstrations, 2019.

K. Papineni, S. Roukos, T. Ard, and W. Zhu, Bleu: a method for automatic evaluation of machine translation, Proc. of ACL, 2002.

S. Daniel, W. Park, Y. Chan, C. Zhang, B. Chiu et al., Specaugment: A simple data augmentation method for automatic speech recognition, Proc. of INTERSPEECH, 2019.

D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu et al., Semi-orthogonal low-rank matrix factorization for deep neural networks, Proc. of INTERSPEECH, 2018.

D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann et al., The Kaldi speech recognition toolkit, IEEE 2011 workshop, 2011.

D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar et al., Purely sequence-trained neural networks for ASR based on lattice-free MMI, Proc. of INTERSPEECH, 2016.

R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault et al., How2: a large-scale dataset for multimodal language understanding, ViGIL Workshop, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02431947

R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, Proc. of ACL, 2016.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. of ICLR, 2015.

J. Smith, H. Saint-amand, M. Plamad?, P. Koehn, C. Callison-burch et al., Dirt cheap web-ccale parallel text from the common crawl, Proc. of ACL, 2013.

M. Sperber and G. Neubig, Attention-passing models for robust and data-efficient end-to-end speech translation, 2019.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, Proc. of CVPR, 2016.

J. Tiedemann, Parallel data, tools and interfaces in OPUS, Proc. of LREC, 2012.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Proc. of NeurIPS, 2017.

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba et al., Espnet: End-toend speech processing toolkit, Proc. of INTER-SPEECH, 2018.

J. Ron, J. Weiss, N. Chorowski, Y. Jaitly, Z. Wu et al., Sequence-tosequence models can directly transcribe foreign speech, Proc. of INTERSPEECH, 2017.