Compact Multiview Representation of Documents Based on the Total Variability Space

Abstract : —Mapping text documents in an LDA-based topic-space is a classical way to extract high-level representation of text documents. Unfortunately, LDA is highly sensitive to hyper-parameters related to the number of classes, or word and topic distribution, and there is no systematic way to pre-estimate optimal configurations. Moreover, various hyper-parameter configurations offer complementary views on the document. In this paper, we propose a method based on a two-step process that, first, expands the representation space by using a set of topic spaces and, second, compacts the representation space by removing poorly relevant dimensions. These two steps are based respectively on multi-view LDA-based representation spaces and factor-analysis models. This model provides a view-independent representation of documents while extracting complementary information from a massive multi-view representation. Experiments are conducted on the DECODA conversation corpus and the Reuters-21578 textual dataset. Results show the efficiency of the proposed multiview compact representation paradigm. The proposed categorization system reaches an accuracy of 86.5% with automatic transcriptions of conversations from DECODA corpus and a Macro-F1 of 80% during a classification task of the well-known Reuters-21578 corpus, with a significant gain compared to the baseline (best single topic space configuration), as well as methods and document representations previously studied.
Document type :
Journal articles
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01319808
Contributor : Bibliothèque Universitaire Déposants Hal-Avignon <>
Submitted on : Monday, May 23, 2016 - 9:40:09 AM
Last modification on : Tuesday, July 2, 2019 - 5:38:02 PM

Identifiers

Collections

Citation

Mohamed Morchid, Mohamed Bouallegue, Richard Dufour, Georges Linares, Driss Matrouf, et al.. Compact Multiview Representation of Documents Based on the Total Variability Space. IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2015, ⟨10.1109/TASLP.2015.2431854⟩. ⟨hal-01319808⟩

Share

Metrics

Record views

89