Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

Guiyao Ke; Pierre-François Marteau

Communication Dans Un Congrès Année : 2014

Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

(1) , (1)

Guiyao Ke

Fonction : Auteur

Expressiveness in Human Centered Data/Media

Pierre-François Marteau

Fonction : Auteur
PersonId : 219
IdHAL : pierre-francois-marteau
ORCID : 0000-0002-3963-8795
IdRef : 033981124

Expressiveness in Human Centered Data/Media

Résumé

We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.

Mots clés

Thematic comparable corpora Comparability measure Co-clustering Cluster alignment

Domaines

Recherche d'information [cs.IR]

Pierre-François Marteau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00995297

Soumis le : vendredi 23 mai 2014-09:40:25

Dernière modification le : vendredi 24 mars 2023-14:52:58

Dates et versions

hal-00995297 , version 1 (23-05-2014)

Identifiants

HAL Id : hal-00995297 , version 1

Citer

Guiyao Ke, Pierre-François Marteau. Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora. The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland. ⟨hal-00995297⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM EC-PARIS UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA IRISA-D6 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

148 Consultations

0 Téléchargements

Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager