Text categorization based on co-classification learning from multilingual corpora

Massih-Reza Amini; Goutte Cyril

Brevet Année : 2013

Text categorization based on co-classification learning from multilingual corpora

(1) ,

Massih-Reza Amini

Fonction : Auteur correspondant
PersonId : 747054
IdHAL : massih-reza-amini
ORCID : 0000-0001-9032-4233
IdRef : 132277042

Connectez-vous pour contacter l'auteur

Analyse de données, Modélisation et Apprentissage automatique [Grenoble]

Goutte Cyril

Fonction : Auteur
PersonId : 947973

Résumé

The patent describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.

Mots clés

Multilingual Text categorization machine learning

Domaines

Apprentissage [cs.LG]

Massih-Reza Amini : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00871809

Soumis le : jeudi 10 octobre 2013-15:45:32

Dernière modification le : jeudi 4 avril 2024-18:26:00

Dates et versions

hal-00871809 , version 1 (10-10-2013)

Identifiants

HAL Id : hal-00871809 , version 1

Citer

Massih-Reza Amini, Goutte Cyril. Text categorization based on co-classification learning from multilingual corpora. United States, Patent n° : 20060101. 2013. ⟨hal-00871809⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_SIDCH LIG_SIDCH_APTIKAL

116 Consultations

0 Téléchargements

Text categorization based on co-classification learning from multilingual corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager