Combining Coregularization and Consensus-Based Self-Training for Multilingual Text Categorization

Massih-Reza Amini; Cyril Goutte; Nicolas Usunier

doi:10.1145/1835449.1835529

Communication Dans Un Congrès Année : 2010

Combining Coregularization and Consensus-Based Self-Training for Multilingual Text Categorization

(1) , , (1)

Massih-Reza Amini

Fonction : Auteur
PersonId : 747054
IdHAL : massih-reza-amini
ORCID : 0000-0001-9032-4233
IdRef : 132277042

Machine Learning and Information Retrieval

Cyril Goutte

Fonction : Auteur

Nicolas Usunier

Fonction : Auteur
PersonId : 933831

Machine Learning and Information Retrieval

Résumé

We investigate the problem of learning document classifiers in a multilingual setting, from collections where labels are only partially available. We address this problem in the framework of multiview learning, where different languages correspond to different views of the same document, combined with semi-supervised learning in order to benefit from unlabeled documents. We rely on two techniques, coregularization and consensus-based self-training, that combine multiview and semi-supervised learning in different ways. Our approach trains different monolingual classifiers on each of the views, such that the classifiers' decisions over a set of unlabeled examples are in agreement as much as possible, and iteratively labels new examples from another unlabeled training set based on a consensus across language-specific classifiers. We derive a boosting-based training algorithm for this task, and analyze the impact of the number of views on the semi-supervised learning results on a multilingual extension of the Reuters RCV1/RCV2 corpus using five different languages. Our experiments show that coregularization and consensus-based self-training are complementary and that their combination is especially effective in the interesting and very common situation where there are few views (languages) and few labeled documents available.

Domaines

Informatique [cs]

Lip6 Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01291883

Soumis le : mardi 22 mars 2016-11:38:07

Dernière modification le : jeudi 14 mars 2024-14:40:45

Dates et versions

hal-01291883 , version 1 (22-03-2016)

Identifiants

HAL Id : hal-01291883 , version 1
DOI : 10.1145/1835449.1835529

Citer

Massih-Reza Amini, Cyril Goutte, Nicolas Usunier. Combining Coregularization and Consensus-Based Self-Training for Multilingual Text Categorization. The 33rd Annual ACM SIGIR Conference (SIGIR 2010), Jul 2010, Geneva, Switzerland. pp.475-482, ⟨10.1145/1835449.1835529⟩. ⟨hal-01291883⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS LIP6 SORBONNE-UNIVERSITE SU-SCIENCES

48 Consultations

0 Téléchargements

Combining Coregularization and Consensus-Based Self-Training for Multilingual Text Categorization

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager