O. Adams, M. Wiesner, S. Watanabe, Y. , and D. , Massively multilingual adversarial speech recognition, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.96-108, 2019.

R. Aharoni, M. Johnson, and O. Firat, Massively multilingual neural machine translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.3874-3884, 2019.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015, pp.3104-3112, 2015.

S. Bansal, H. Kamper, A. Lopez, and S. Goldwater, Towards speech-to-text translation without speech recognition, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol.2, pp.474-479, 2017.

S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater, Pre-training on high-resource speech recognition improves low-resource speech-to-text translation, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, vol.1, pp.58-68, 2019.

A. Bérard, O. Pietquin, C. Servan, and L. Besacier, Listen and translate: A proof of concept for endto-end speech-to-text translation, NIPS Workshop on End-to-end Learning for Speech and Audio Processing, 2016.

A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, End-to-end automatic speech translation of audiobooks, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5971-5975, 2018.

C. Christodoulopoulos and M. Steedman, A massively parallel corpus: the bible in 100 languages. Language Resources and Evaluation, vol.49, pp.375-395, 2015.

G. Chrupa?a, L. Gelderloos, A. , and A. , Representations of language in a model of visually grounded speech signal, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.613-622, 2017.

Y. Chung, W. Weng, S. Tong, and J. R. Glass, Towards unsupervised speech-to-text translation, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.7170-7174, 2018.

D. Gangi, M. A. Cattoni, R. Bentivogli, L. Negri, M. Turchi et al., Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.2, 2016.

D. Harwath, G. Chuang, and J. R. Glass, Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4969-4973, 2018.

D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba et al., Jointly discovering visual objects and spoken words from raw sensory input, Lecture Notes in Computer Science, vol.11210, issue.6, pp.659-677, 2018.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997.

J. Iranzo-sánchez, J. A. Silvestre-cerdà, J. Jorge, N. Roselló, A. Giménez et al., Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates, ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.

Y. Jia, M. Johnson, W. Macherey, R. Weiss, Y. Cao et al., , 2019.

, Leveraging weakly supervised data to improve end-toend speech-to-text translation, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p.5

Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson et al., Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model, Proc. Interspeech, pp.1123-1127, 2019.

T. Kisler, U. Reichel, and F. Schiel, Multilingual processing of speech via web services, Computer Speech & Language, vol.45, pp.326-347, 2017.

A. C. Kocabiyikoglu, L. Besacier, and O. Kraif, Augmenting librispeech with French translations: A multimodal corpus for direct speech translation evaluation, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01709568

L. Lee, J. Glass, H. Lee, C. , and C. , Spoken content retrieval beyond cascading speech recognition with text retrieval, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue.9, pp.1389-1420, 2015.

P. Littell, D. R. Mortensen, K. Lin, K. Kairis, C. Turner et al., Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors, Proceedings of the 15th Conference of the European Chapter, vol.2, pp.8-14, 2017.

R. Navigli and S. P. Ponzetto, BabelNet: Building a very large multilingual semantic network, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp.216-225, 2010.

J. Nivre, M. De-marneffe, F. Ginter, Y. Goldberg, J. Hajic et al., Universal dependencies v1: A multilingual treebank collection, Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, 2016.

M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-burch et al., Improved speech-to-text translation with the fisher and callhome spanish-english speech translation corpus, International Workshop on Spoken Language Translation, 2013.

R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault et al., How2: a large-scale dataset for multimodal language understanding, Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02431947

T. Schultz and T. Schlippe, Globalphone: Pronunciation dictionaries in 20 languages, Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp.337-341, 2014.

H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. Guzmán, Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from Wikipedia, 2019.

G. Sérasset, Dbnary: Wiktionary as a lemonbased multilingual lexical resource in rdf, Semantic Web, vol.6, issue.4, pp.355-361, 2015.

M. Sperber, G. Neubig, J. Niehues, and A. Waibel, Attention-passing models for robust and dataefficient end-to-end speech translation, Transactions of the Association for Computational Linguistics, vol.7, pp.313-325, 2019.

Y. Wang, R. J. Skerry-ryan, D. Stanton, Y. Wu, R. J. Weiss et al., Tacotron: Towards end-to-end speech synthesis, 18th Annual Conference of the International Speech Communication Association, pp.4006-4010, 2017.

R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, C. et al., Sequence-to-sequence models can directly translate foreign speech, 18th Annual Conference of the International Speech Communication Association, pp.2625-2629, 2017.