Short Text Classification Using Semantic Random Forest

Using traditional Random Forests in short text classification revealed a performance degradation compared to using them for standard texts. Shortness, sparseness and lack of contextual information in short texts are the reasons of this degradation. Existing solutions to overcome these issues are mainly based on data enrichment. However, data enrichment can also introduce noise. We propose a new approach that combines data enrichment with the introduction of semantics in Random Forests. Each short text is enriched with data semantically similar to its words. These data come from an external source of knowledge distributed into topics thanks to the Latent Dirichlet Allocation model. Learning process in Random Forests is adapted to consider semantic relations between words while building the trees. Tests performed on search-snippets using the new method showed significant improvements in the classification. The accuracy has increased by 34% compared to traditional Random Forests and by 20% compared to MaxEnt.

Mots clés

Short text classification Random Forest Latent Dirichlet Allocation Semantics

Domaines

Informatique [cs] Apprentissage [cs.LG] Traitement du texte et du document

Frédéric Precioso : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01325212

Soumis le : jeudi 2 juin 2016-00:54:53

Dernière modification le : lundi 26 février 2024-11:22:11

Dates et versions

hal-01325212 , version 1 (02-06-2016)

Identifiants

HAL Id : hal-01325212 , version 1
DOI : 10.1007/978-3-319-10160-6_26

Citer

Ameni Bouaziz, Christel Dartigues-Pallez, Célia da Costa Pereira, Frédéric Precioso, Patrick Lloret. Short Text Classification Using Semantic Random Forest. Data Warehousing and Knowledge Discovery, Sep 2014, Munich, Germany. ⟨10.1007/978-3-319-10160-6_26⟩. ⟨hal-01325212⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS I3S UNIV-COTEDAZUR

191 Consultations

0 Téléchargements