Skip to Main content Skip to Navigation

Unsupervised cross-lingual representation modeling for variable length phrases

Jingshu Liu 1
1 TALN - Traitement Automatique du Langage Naturel
LS2N - Laboratoire des Sciences du Numérique de Nantes
Abstract : Significant advances have been achieved in bilingual word-level alignment from comparable corpora, yet the challenge remains for phrase-level alignment. Traditional methods to phrase alignment can only handle phrase of equal length, while word embedding based approaches learn phrase embeddings as individual vocabulary entries suffer from the data sparsity and cannot handle out of vocabulary phrases. Since bilingual alignment is a vector comparison task, phrase representation plays a key role. In this thesis, we study the approaches for unified phrase modeling and cross-lingual phrase alignment, ranging from co-occurrence models to most recent neural state-of-the-art approaches. We review supervised and unsupervised frameworks for modeling cross-lingual phrase representations. Two contributions are proposed in this work. First, a new architecture called tree-free recursive neural network (TF-RNN) for modeling phrases of variable length which, combined with a wrapped context prediction training objective, outperforms the state-of-the-art approaches on monolingual phrase synonymy task with only plain text training data. Second, for cross-lingual modeling, we propose to incorporate an architecture derived from TF-RNN in an encoder-decoder model with a pseudo back translation mechanism inspired by unsupervised neural machine translation. Our proposition improves significantly bilingual alignment of different length phrases.
Document type :
Complete list of metadata

Cited literature [140 references]  Display  Hide  Download
Contributor : Emmanuel Morin Connect in order to contact the contributor
Submitted on : Wednesday, September 23, 2020 - 6:41:21 PM
Last modification on : Wednesday, April 27, 2022 - 3:56:00 AM
Long-term archiving on: : Friday, December 4, 2020 - 6:44:11 PM


thesis_of_Jingshu_Liu (2).pdf
Files produced by the author(s)


  • HAL Id : tel-02938554, version 1


Jingshu Liu. Unsupervised cross-lingual representation modeling for variable length phrases. Computation and Language [cs.CL]. Université de Nantes, 2020. English. ⟨tel-02938554⟩



Record views


Files downloads