Compact Multiview Representation of Documents Based on the Total Variability Space

Mohamed Morchid; Mohamed Bouallegue; Richard Dufour; Georges Linares; Driss Matrouf; Renato de Mori

doi:10.1109/TASLP.2015.2431854

Article Dans Une Revue IEEE/ACM Transactions on Audio, Speech and Language Processing Année : 2015

Compact Multiview Representation of Documents Based on the Total Variability Space

(1) , (1) , (1) , (1) , (1) , (2, 1)

1
2

Mohamed Morchid

Fonction : Auteur
PersonId : 21451
IdHAL : morchid
ORCID : 0000-0002-4427-2468
IdRef : 188328343

Laboratoire Informatique d'Avignon

Mohamed Bouallegue

Fonction : Auteur
PersonId : 772200
IdRef : 177675128

Laboratoire Informatique d'Avignon

Richard Dufour

Fonction : Auteur
PersonId : 178348
IdHAL : richard-dufour
ORCID : 0000-0003-1203-9108

Laboratoire Informatique d'Avignon

Georges Linares

Fonction : Auteur
PersonId : 4977
IdHAL : georges-linares
IdRef : 079368794

Laboratoire Informatique d'Avignon

Driss Matrouf

Fonction : Auteur
PersonId : 176307
IdHAL : driss-matrouf
IdRef : 137773439

Laboratoire Informatique d'Avignon

Renato de Mori

Fonction : Auteur

McGill University = Université McGill [Montréal, Canada]

Laboratoire Informatique d'Avignon

Résumé

—Mapping text documents in an LDA-based topic-space is a classical way to extract high-level representation of text documents. Unfortunately, LDA is highly sensitive to hyper-parameters related to the number of classes, or word and topic distribution, and there is no systematic way to pre-estimate optimal configurations. Moreover, various hyper-parameter configurations offer complementary views on the document. In this paper, we propose a method based on a two-step process that, first, expands the representation space by using a set of topic spaces and, second, compacts the representation space by removing poorly relevant dimensions. These two steps are based respectively on multi-view LDA-based representation spaces and factor-analysis models. This model provides a view-independent representation of documents while extracting complementary information from a massive multi-view representation. Experiments are conducted on the DECODA conversation corpus and the Reuters-21578 textual dataset. Results show the efficiency of the proposed multiview compact representation paradigm. The proposed categorization system reaches an accuracy of 86.5% with automatic transcriptions of conversations from DECODA corpus and a Macro-F1 of 80% during a classification task of the well-known Reuters-21578 corpus, with a significant gain compared to the baseline (best single topic space configuration), as well as methods and document representations previously studied.

Mots clés

Index Terms—C-vector classification factor analysis latent Dirichlet allocation

Domaines

Informatique [cs]

bibliothèque Universitaire Déposants HAL-Avignon : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01319808

Soumis le : lundi 23 mai 2016-09:40:09

Dernière modification le : vendredi 12 novembre 2021-11:18:05

Dates et versions

hal-01319808 , version 1 (23-05-2016)

Identifiants

HAL Id : hal-01319808 , version 1
DOI : 10.1109/TASLP.2015.2431854

Citer

Mohamed Morchid, Mohamed Bouallegue, Richard Dufour, Georges Linares, Driss Matrouf, et al.. Compact Multiview Representation of Documents Based on the Total Variability Space. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, ⟨10.1109/TASLP.2015.2431854⟩. ⟨hal-01319808⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-AVIGNON LIA

73 Consultations

0 Téléchargements

Compact Multiview Representation of Documents Based on the Total Variability Space

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager