Semi-Supervised Document Classification with a Mislabeling Error Model

Anastasia Krithara; Massih-Reza Amini; Cyril Goutte

doi:10.1007/978-3-540-78646-7_34

Communication Dans Un Congrès Année : 2008

Semi-Supervised Document Classification with a Mislabeling Error Model

(1) , (1) ,

Anastasia Krithara

Fonction : Auteur
PersonId : 980877

Machine Learning and Information Retrieval

Massih-Reza Amini

Fonction : Auteur
PersonId : 747054
IdHAL : massih-reza-amini
ORCID : 0000-0001-9032-4233
IdRef : 132277042

Machine Learning and Information Retrieval

Cyril Goutte

Fonction : Auteur

Résumé

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.

Domaines

Informatique [cs]

Lip6 Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01301551

Soumis le : mardi 12 avril 2016-14:18:04

Dernière modification le : jeudi 14 mars 2024-14:40:45

Dates et versions

hal-01301551 , version 1 (12-04-2016)

Identifiants

HAL Id : hal-01301551 , version 1
DOI : 10.1007/978-3-540-78646-7_34

Citer

Anastasia Krithara, Massih-Reza Amini, Cyril Goutte. Semi-Supervised Document Classification with a Mislabeling Error Model. European Conference on Information Retrieval (ECIR'08), Mar 2008, Glasgow, United Kingdom. pp.370-381, ⟨10.1007/978-3-540-78646-7_34⟩. ⟨hal-01301551⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS LIP6 SORBONNE-UNIVERSITE SU-SCIENCES

44 Consultations

0 Téléchargements

Semi-Supervised Document Classification with a Mislabeling Error Model

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager