# Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

1 EXPRESSION - Expressiveness in Human Centered Data/Media
UBS - Université de Bretagne Sud, IRISA-D6 - MEDIA ET INTERACTIONS
Abstract : We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
Keywords :
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-00995297
Contributor : Pierre-François Marteau <>
Submitted on : Friday, May 23, 2014 - 9:40:25 AM
Last modification on : Thursday, November 15, 2018 - 11:58:49 AM

### Identifiers

• HAL Id : hal-00995297, version 1

### Citation

Guiyao Ke, Pierre-François Marteau. Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora. The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland. ⟨hal-00995297⟩

Record views