Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation

Natalia Grabar; Thierry Hamon

Communication Dans Un Congrès Année : 2017

Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation

(1) , (2, 3)

1
2
3

Natalia Grabar

Fonction : Auteur
PersonId : 6735
IdHAL : natalia-grabar
ORCID : 0000-0002-0237-4554
IdRef : 089015460

Savoirs, Textes, Langage (STL) - UMR 8163

Thierry Hamon

Fonction : Auteur
PersonId : 11519
IdHAL : thierry-hamon
ORCID : 0000-0002-1521-4875
IdRef : 069054711

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Université Paris 13

Résumé

The question on creation of linguistic resources (such as corpora, lexica or terminologies) occupies an important place in the research areas related to linguistics, Natural Language Processing, Computer Sciences, psycholinguistics, etc. In this paper, we propose the description of a multilingual corpus in which Ukrainian is the target language, while source languages are Polish, French and English. The corpus contains literary texts and a small subset built with texts provided by medical area. On the whole, the corpus is composed of 62 literary texts and 129 medical texts. The corpus counts over 1 million words in the target Ukrainian language, and at least as much in the source languages taken all together. This is a directional corpus aligned at the level of sentences. After the description of this corpus, we introduce some possible exploitations and first results. We then conclude and indicate some directions for future work. The corpus presented in this work is available for the research purposes: http://natalia.grabar.free.fr/resources.php

Mots clés

Parallel corpora Ukrainian Natural Language Processing

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Limsi Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01736363

Soumis le : vendredi 16 mars 2018-19:23:33

Dernière modification le : mercredi 28 février 2024-14:37:09

Dates et versions

hal-01736363 , version 1 (16-03-2018)

Identifiants

HAL Id : hal-01736363 , version 1

Citer

Natalia Grabar, Thierry Hamon. Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. Computational Linguistics and Intelligent Systems, Apr 2017, Kharkiv, Ukraine. ⟨hal-01736363⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-PARIS13 CNRS LIMSI STL USPC UNIV-PARIS-SACLAY UNIV-LILLE SORBONNE-UNIVERSITE SORBONNE-PARIS-NORD LISN GS-ENGINEERING GS-COMPUTER-SCIENCE GS-SPORT-HUMAN-MOVEMENT ACT-R

83 Consultations

0 Téléchargements

Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager