W. Lawrence, K. Barsalou, A. K. Simmons, C. D. Barbey, and . Wilson, Grounding conceptual knowledge in modality-specific systems, Trends in Cognitive Sciences, 2003.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, Proceedings of the International Conference on Computer Vision (ICCV), 2015.

L. Specia, S. Frank, K. Sima'an, and D. Elliott, A shared task on multimodal machine translation and crosslingual image description (WMT), Proceedings of the First Conference on Machine Translation (WMT). ACL, 2016.

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Visual Dialog, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

S. Palaskar, R. Sanabria, and F. Metze, End-to-end multimodal speech recognition, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta et al., Microsoft COCO captions: Data collection and evaluation server, Computing Research, 2015.

J. John, . Godfrey, C. Edward, J. Holliman, and . Mcdaniel, Switchboard: Telephone speech corpus for research and development, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1992.

Y. Miao and F. Metze, Open-domain audio-visual speech recognition: A deep learning approach, Proceedings of Interspeech. ISCA, 2016.

A. Gupta, Y. Miao, L. Neves, and F. Metze, Visual features for contextaware speech recognition, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2017.

O. Caglayan, M. García-martínez, A. Bardet, W. Aransa, F. Bougares et al., NMTPY: A flexible toolkit for advanced neural machine translation systems, The Prague Bulletin of Mathematical Linguistics, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01647873

-. Shoou, L. Yu, A. Jiang, and . Hauptmann, Instructional videos for unsupervised harvesting and learning of action examples, Proceedings of the International Multimedia Conference (ACMM), 2014.

M. David, . Blei, Y. Andrew, and M. Ng, Latent dirichlet allocation, Journal of Machine Learning Research, 2003.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation

K. Cho, C. Bart-van-merrienboer, D. Gulcehre, F. Bahdanau, H. Bougares et al., Learning phrase representations using rnn encoderdecoder for statistical machine translation, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

J. Ron, J. Weiss, N. Chorowski, Y. Jaitly, Z. Wu et al., Sequence-tosequence models can directly translate foreign speech, Proceedings of Interspeech. ISCA, 2017.

A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, End-to-end automatic speech translation of audiobooks, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.

J. Libovický and J. Helcl, Attention strategies for multi-source sequence-tosequence learning, Proceedings Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 2017.

J. Libovický, S. Palaskar, S. Gella, and F. Metze, Multimodal abstractive summarization of open-domain videos, Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL). NIPS, 2018.

M. Hodosh, P. Young, and J. Hockenmaier, Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics, Journal of Artificial Intelligence Research, 2013.

B. Mesut-erhan-unal, S. Citamak, A. Yagcioglu, E. Erdem, and . Erdem, Nazli Ikizler Cinbis, and Ruket Cakici. Tasviret: Görüntülerden otomatik türkçe aç?klama olu?turma?çin bir denektaç? veri kümesi (TasvirEt: A benchmark dataset for automatic Turkish description generation from images), Proceesdings of the Sinyal??leme ve?leti?im Uygulamalar? Kurultay?, 2016.

X. Li, W. Lan, J. Dong, and H. Liu, Adding Chinese captions to images, Proceedings of the International Conference on Multimedia Retrieval (ICMR), 2016.

A. Bryan, L. Plummer, C. M. Wang, J. C. Cervantes, J. Caicedo et al., Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, International Journal of Computer Vision, 2017.

D. Elliott, S. Frank, K. Sima'an, and L. Specia, Multi30k: Multilingual english-german image descriptions, Proceedings of the Workshop on Vision and Language. ACL, 2016.

Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, STAIR captions: Constructing a largescale Japanese image caption dataset, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 2017.

D. Elliott, S. Frank, L. Barrault, F. Bougares, and L. Specia, Findings of the second shared task on multimodal machine translation and multilingual image description, Proceedings of the Second Conference on Machine Translation (WMT). ACL, 2018.

L. Barrault, F. Bougares, L. Specia, C. Lala, D. Elliott et al., Findings of the shared task on multimodal machine translation (WMT), Proceedings of Conference on Machine Translation (WMT). ACL, 2018.

L. David, W. B. Chen, and . Dolan, Building a persistent workforce on Mechanical Turk for multilingual data collection, Proceedings of The 3rd Human Computation Workshop (HCOMP). AAAI, 2011.

A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal et al., Movie description, International Journal of Computer Vision, 2017.

J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: a large video description dataset for bridging video and language, Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR, 2016.

M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America, 2006.

C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin et al., Audio visual speech recognition, IDIAP, 2000.

S. Joon, A. Chung, and . Zisserman, Lip reading in the wild, Proceedings of the Asian Conference on Computer Vision (ACCV), 2016.

M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-burch et al., Improved Speech-to-Text translation with the Fisher and Callhome Spanish-English speech translation corpus, Proceedings International Workshop on Spoken Language Translation (IWSLT). ACL, 2013.

. Ali-can, L. Kocabiyikoglu, O. Besacier, and . Kraif, Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation, Proceedings of the International Conference on Language Resources and Evaluation (LREC)

. Elra, , 2018.

K. Moritz-hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay et al., Teaching machines to read and comprehend, Proceedings of the International Conference on Neural Information Processing Systems (NIPS). NIPS, 2015.

R. Nallapati, B. Zhou, . Cicero-dos-santos, G. Çaglar, and B. Xiang, Abstractive text summarization using sequence-to-sequence rnns and beyond, Proceedings of the Conference on Computational Natural Language Learning (CoNLL). ACL, 2016.

P. Over, H. Dang, and D. Harman, Duc in context. Information Processing & Management, 2007.

H. Li, J. Zhu, C. Ma, J. Zhang, and C. Zong, Multi-modal summarization for asynchronous collection of text, image, audio and video, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2017.

R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem et al., Automatic description generation from images: A survey of models, datasets, and evaluation measures, Journal of Artificial Intelligence Research, 2016.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, 2014.

M. Grubinger, P. D. Clough, H. Muller, and T. Desealers, The IAPR TC-12 benchmark: A new evaluation resource for visual information systems, Proceedings of International Conference on Language Resources and Evaluation (LREC). ELRA, 2006.

N. Uzzaman, P. Jeffrey, J. Bigham, and . Allen, Multimodal summarization of complex sentences, Proceedings International Conference on Intelligent User Interfaces (IUI), 2011.

C. Napoles, M. Gormley, and B. Van-durme, Annotated gigaword, Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. ACL, 2012.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The Kaldi speech recognition toolkit, Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.

K. Hara, H. Kataoka, and Y. Satoh, Can spatiotemporal 3d CNNs retrace the history of 2d cnns and imagenet?, Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR), 2018.

T. Kudo, SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL

W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2016.

R. Sennrich, O. Firat, K. Cho, A. Birch-mayne, B. Haddow et al., Nematus: a toolkit for neural machine translation, Proceedings of the European Chapter of the Association for Computational Linguistics (EACL). Software Demonstrations. ACL, 2017.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, Computing Research Repository, 2014.

O. Press and L. Wolf, Using the output embedding to improve language models, Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). ACL, 2017.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 2014.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, Journal of Machine Learning Research, 2014.

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, Proceedings of the International Conference on International Conference on Machine Learning (ICML). IMIS, 2013.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, BLEU: A method for automatic evaluation of machine translation, Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL). ACL, 2002.

C. Lin and F. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, Proceedings of the Meeting of the Association for Computational Linguistics (ACL). ACL, 2004.

, Training Details Unless otherwise specified, we use ADAM [52] as the optimizer with an initial learning rate of 0.0004. The gradients are clipped to have unit norm [53]. The training is early stopped if the task performance on validation set does not improve for 10 consecutive epochs. Task performance is assessed using Word Error Rate (WER) for speech recognition

, All systems are trained three times with different random initializations. The hypotheses are decoded using beam search with a beam size of 10. We report the average results of the three runs