1EXPRESSION - Expressiveness in Human Centered Data/Media (Université de Bretagne-Sud - Campus de Tohannic - bâtiment ENSIBS - Rue Yves Mainguy BP 573 - 56017 Vannes cedex
École Nationale Supérieure des Sciences Appliquées et de Technologie - 6, rue de Kerampont - CS 80518 - 22305 Lannion cedex - France)
CentraleSupélec (3, rue Joliot Curie,
Plateau de Moulon,
91192 GIF-SUR-YVETTE Cedex - France)
Abstract : We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ''thematic'' comparability measure that defines a mapping between these two spaces. With the improvement of the co-clustering ($k$-medoids) performance we get, we use a comparability threshold and a manual verification to ensure the good and robust alignment of co-clusters (co-medoids). Finally, from any available raw corpus, we enrich the aligned clusters in order to provide ''thematic'' comparable corpora of good quality and controlled size. On a case study that exploit raw web data, we show that this approach scales reasonably well and is quite suited for the construction of thematic comparable corpora of good quality.
https://hal.archives-ouvertes.fr/hal-00995297
Contributeur : Pierre-François Marteau
<>
Soumis le : vendredi 23 mai 2014 - 09:40:25
Dernière modification le : jeudi 15 novembre 2018 - 11:58:49
Guiyao Ke, Pierre-François Marteau. Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora. The 9th edition of the Language Resources and Evaluation Conference, LREC 2014, May 2014, Reykjavik, Iceland. 2014. 〈hal-00995297〉