Automatic Compilation of Comparable corpora

Manuela Yapomo

Communication Dans Un Congrès Année : 2012

Automatic Compilation of Comparable corpora

(1, 2)

1
2

Manuela Yapomo

Fonction : Auteur
PersonId : 960716

Linguistique, Langues et Parole

Laboratoire des sciences de l'ingénieur, de l'informatique et de l'imagerie

Résumé

The exploitation of comparable corpora has proven to be a valuable alternative to rare parallel corpora in various Natural Language Processing tasks. Therefore many researchers have stressed the need for large quantities of such corpora and the scarcity of works on their compilation. Our purpose in this paper is to address this issue by using the CLIR-based method for the automatic acquisition of French-English comparable documents. At the start of the process, source documents are translated and most representative terms are extracted. The resulting keyword list is further enlarged with synonyms on the assumption that keyword expansion might improve the retrieval of such documents. Retrieval is performed on the indexed target collection and a further filtering step based mainly on temporal information and document length takes place. Results are fair and suggest that the use of ontology may improve the performance of the system.

Mots clés

Cross-language information retrieval (non-)linguistic criteria similarity measurement

Domaines

Informatique et langage [cs.CL]

Fichier principal

Article_ManuelaYapomo_draftversion.pdf (205.93 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Manuela Yapomo : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01073850

Soumis le : vendredi 10 octobre 2014-15:27:42

Dernière modification le : jeudi 11 avril 2024-13:08:14

Archivage à long terme le : dimanche 11 janvier 2015-11:02:13

Dates et versions

hal-01073850 , version 1 (10-10-2014)

Identifiants

HAL Id : hal-01073850 , version 1

Citer

Manuela Yapomo. Automatic Compilation of Comparable corpora. Natural Language Processing and Human Language Technology, Jun 2011, Faro, Portugal. ⟨hal-01073850⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS ENGEES INSA-STRASBOURG INC-CNRS SITE-ALSACE INSA-GROUPE

81 Consultations

66 Téléchargements

Automatic Compilation of Comparable corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager