# Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

1 EXPRESSION - Expressiveness in Human Centered Data/Media
UBS - Université de Bretagne Sud, IRISA-D6 - MEDIA ET INTERACTIONS
Abstract : We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
Keywords :
Type de document :
Communication dans un congrès
The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland. 2014

https://hal.archives-ouvertes.fr/hal-00995297
Contributeur : Pierre-François Marteau <>
Soumis le : vendredi 23 mai 2014 - 09:40:25
Dernière modification le : mercredi 16 mai 2018 - 11:24:07

### Identifiants

• HAL Id : hal-00995297, version 1

### Citation

Guiyao Ke, Pierre-François Marteau. Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora. The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland. 2014. 〈hal-00995297〉

### Métriques

Consultations de la notice