A Co-classification Approach to Learning from Multilingual Corpora

Massih-Reza Amini 1 Cyril Goutte
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.
Document type :
Journal articles
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01172633
Contributor : Lip6 Publications <>
Submitted on : Tuesday, July 7, 2015 - 3:56:19 PM
Last modification on : Thursday, March 21, 2019 - 2:34:27 PM

Links full text

Identifiers

Citation

Massih-Reza Amini, Cyril Goutte. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning, Springer Verlag, 2010, 79 (1-2), pp.105-121. ⟨10.1007/s10994-009-5151-5⟩. ⟨hal-01172633⟩

Share

Metrics

Record views

122