A. Abeillé, L. Clément, and F. Toussenel, Building a Treebank for French, pp.165-187, 2003.

W. Antoun, F. Baly, and H. Hajj, Arabert: Transformer-based model for arabic language understanding, 2020.

M. Artetxe and H. Schwenk, Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond, Transactions of the Association for Computational Linguistics, vol.7, pp.597-610, 2019.

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of machine learning research, vol.3, pp.1137-1155, 2003.

J. Blitzer, R. Mcdonald, and F. Pereira, Domain adaptation with structural correspondence learning, Proceedings of the 2006 conference on empirical methods in natural language processing, pp.120-128, 2006.

C. Buck, K. Heafield, and B. Van-ooyen, N-gram counts and language models from the common crawl, Proceedings of the Language Resources and Evaluation Conference, 2014.

J. Cañete, G. Chaperon, R. Fuentes, and J. Rez, Spanish pre-trained bert model and evaluation data, 2020.

Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang et al., Enhanced lstm for natural language inference, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1657-1668, 2017.

R. Collobert and J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning, Proceedings of the 25th international conference on Machine learning, pp.160-167, 2008.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu et al., Natural language processing (almost) from scratch, Journal of machine learning research, vol.12, pp.2493-2537, 2011.

A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman et al., Xnli: Evaluating cross-lingual sentence representations, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.2475-2485, 2018.

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek et al., Unsupervised cross-lingual representation learning at scale, 2019.

M. Constant, M. Candito, and D. Seddah, The ligm-alpage architecture for the spmrl 2013 shared task: Multiword expression analysis and dependency parsing, Proceedings of the EMNLP Workshop on Statistical Parsing of Morphologically Rich Languages, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00932372

A. M. Dai and Q. V. Le, Semi-supervised sequence learning, Advances in neural information processing systems, pp.3079-3087, 2015.

W. De-vries, A. Van-cranenburgh, A. Bisazza, T. Caselli, G. Van-noord et al., Bertje: A dutch bert model, 2019.

P. Delobelle, T. Winters, and B. Berendt, Robbert: a dutch roberta-based language model, 2020.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.4171-4186, 2019.

T. Dozat and C. D. Manning, Deep biaffine attention for neural dependency parsing, ICLR, 2016.

A. Eisele and Y. Chen, Multiun: A multilingual corpus from united nation documents, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), 2010.

J. Eisenschlos, S. Ruder, P. Czapla, M. Kardas, S. Gugger et al., Multifit: Efficient multilingual language model fine-tuning, Proceedings of the 2019 conference on empirical methods in natural language processing (EMNLP), pp.1532-1543, 2019.

A. Fan, E. Grave, J. , and A. , Reducing transformer depth on demand with structured dropout, International Conference on Learning Representations, 2019.

L. Gong, D. He, Z. Li, T. Qin, L. Wang et al., Efficient training of bert by progressively stacking, International Conference on Machine Learning, pp.2337-2346, 2019.

M. Hadj-salah, Arabic word sense disambiguation for and by machine translation. Theses, Faculté des Scienceséconomiques et de gestion, 2018.
URL : https://hal.archives-ouvertes.fr/tel-02139438

S. Hochreiter and J. Schmidhuber, Long shortterm memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

J. Howard and S. Ruder, Universal language model fine-tuning for text classification, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.328-339, 2018.

G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, Deep networks with stochastic depth, European conference on computer vision, pp.646-661, 2016.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

N. Kitaev and D. Klein, Constituency parsing with a self-attentive encoder, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.2676-2686, 2018.

N. Kitaev, S. Cao, and D. Klein, Multilingual constituency parsing with self-attention and pre-training, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.3499-3505, 2019.

P. Koehn, H. Hoang, A. Birch, C. Callison-burch, M. Federico et al., Moses: Open source toolkit for statistical machine translation, Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp.177-180, 2007.

P. Koehn, Europarl: A parallel corpus for statistical machine translation, Machine Translation Summit, pp.79-86, 2005.

Y. Kuratov and M. Arkhipov, Adaptation of deep bidirectional multilingual transformers for russian language, 2019.

G. Lample and A. Conneau, Cross-lingual language model pretraining, Advances in neural information processing systems, 2019.

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma et al., Albert: A lite bert for selfsupervised learning of language representations, 2019.

X. Li, P. Michel, A. Anastasopoulos, Y. Belinkov, N. Durrani et al., Findings of the first shared task on machine translation robustness, p.91, 2019.

P. Lison and J. Tiedemann, Opensubtitles2015: Extracting large parallel corpora from movie and tv subtitles, International Conference on Language Resources and Evaluation, 2016.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi et al., Roberta: A robustly optimized bert pretraining approach, 2019.

L. Martin, B. Muller, P. J. Ortiz-suárez, Y. Dupont, L. Romary et al., CamemBERT: a Tasty French Language Model. arXiv e-prints, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02445946

B. Mccann, J. Bradbury, C. Xiong, and R. Socher, Learned in translation: Contextualized word vectors, Advances in Neural Information Processing Systems, pp.6294-6305, 2017.

, Data dumps -meta, discussion about wikimedia projects, Meta, 2019.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, Proceedings of the 26th International Conference on Neural Information Processing Systems, vol.2, pp.3111-3119, 2013.

G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker, A semantic concordance, Proceedings of the workshop on Human Language Technology, HLT '93, pp.303-308, 1993.

G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM, vol.38, issue.11, pp.39-41, 1995.

R. Navigli and S. P. Ponzetto, Babelnet: Building a very large multilingual semantic network, Proceedings of the 48th annual meeting of the association for computational linguistics, pp.216-225, 2010.

R. Navigli, D. Jurgens, and D. Vannella, SemEval-2013 Task 12: Multilingual Word Sense Disambiguation, Proceedings of the Seventh International Workshop on Semantic Evaluation, vol.2, pp.222-231, 2013.

R. Navigli, Word sense disambiguation: A survey, ACM Computing Surveys, vol.41, issue.2, 2009.

D. Q. Nguyen and A. T. Nguyen, Phobert: Pretrained language models for vietnamese, 2020.

T. Q. Nguyen and J. Salazar, Transformers without tears: Improving the normalization of self-attention, 2019.

M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross et al.,

D. Grangier, A. , and M. , fairseq: A fast, extensible toolkit for sequence modeling, Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation, 2014.

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark et al., Deep contextualized word representations, Proceedings of NAACL-HLT, pp.2227-2237, 2018.

N. Pham, T. Nguyen, J. Niehues, M. Muller, and A. Waibel, Very deep self-attention networks for end-to-end speech recognition, 2019.

M. Polignano, P. Basile, M. De-gemmis, G. Semeraro, and V. Basile, AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets, Proceedings of the Sixth Italian Conference on Computational Linguistics, p.2481, 2019.

P. Prettenhofer and B. Stein, Cross-language text classification using structural correspondence learning, Proceedings of the 48th annual meeting of the association for computational linguistics, pp.1118-1127, 2010.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, 2018.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang et al., Exploring the limits of transfer learning with a unified textto-text transformer, 2019.

P. Rajpurkar, R. Jia, and P. Liang, Know what you don't know: Unanswerable questions for squad, 2018.

P. Ramachandran, P. Liu, L. , and Q. , Unsupervised pretraining for sequence to sequence learning, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.383-391, 2017.

D. Seddah, R. Tsarfaty, S. Kübler, M. Candito, J. D. Choi et al., Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages, Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp.146-182, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00877096

V. Segonne, M. Candito, and B. Crabbé, Using wiktionary as a resource for wsd: the case of french verbs, Proceedings of the 13th International Conference on Computational Semantics-Long Papers, pp.259-270, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02436417

R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.1715-1725, 2016.

G. Sérasset, Dbnary: Wiktionary as a lmf based multilingual rdf network, Language Resources and Evaluation Conference, 2012.

R. Skadins, J. Tiedemann, R. Rozis, and D. Deksne, Billions of parallel words for free: Building and using the EU bookshop corpus, Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC, pp.1850-1855, 2014.

F. Souza, R. Nogueira, and R. Lotufo, Portuguese named entity recognition using bert-crf, 2019.

W. Styler, The enronsent corpus, 2011.

I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, Advances in neural information processing systems, pp.3104-3112, 2014.

J. ;. Tiedemann, Parallel data, tools and interfaces in opus, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), 2012.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in neural information processing systems, pp.5998-6008, 2017.

A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez et al., Tensor2tensor for neural machine translation, Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, vol.1, pp.193-199, 2018.

L. Vial, B. Lecouteux, and D. Schwab, UF-SAC: Unification of Sense Annotated Corpora and Tools, Language Resources and Evaluation Conference (LREC), 2018.

L. Vial, B. Lecouteux, and D. Schwab, Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense Disambiguation, Proceedings of the 10th Global Wordnet Conference, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02131872

A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti et al., Multilingual is not enough: Bert for finnish, 2019.

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy et al., GLUE: A multi-task benchmark and analysis platform for natural language understanding, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.353-355, 2018.

A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael et al., Superglue: A stickier benchmark for generalpurpose language understanding systems, 2019.

Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li et al., Learning deep transformer models for machine translation, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.1810-1822, 2019.

A. Williams, N. Nangia, and S. Bowman, A broad-coverage challenge corpus for sentence understanding through inference, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.1112-1122, 2018.

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue et al., Huggingface's transformers: State-of-the-art natural language processing, ArXiv, 2019.

H. Xu, Q. Liu, J. Van-genabith, and J. Zhang, Why deep transformers are difficult to converge? from computation order to lipschitz restricted parameter initialization, 2019.

Y. Yang, Y. Zhang, C. Tar, and J. Baldridge, Paws-x: A cross-lingual adversarial dataset for paraphrase identification, 2019.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov et al., Xlnet: Generalized autoregressive pretraining for language understanding, Advances in neural information processing systems, 2019.

H. Zhang, Y. N. Dauphin, and T. Ma, Fixup initialization: Residual learning without normalization, 2019.

Y. Zhang, J. Baldridge, and L. He, Paws: Paraphrase adversaries from word scrambling, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1, pp.1298-1308, 2019.