XML Document Classification using SVM

Samaneh Chagheri; Catherine Roussey; Sylvie Calabretto; Cyril Dumoulin

Communication Dans Un Congrès Année : 2010

XML Document Classification using SVM

(1) , (2) , (1) , (3)

1
2
3

Samaneh Chagheri

Fonction : Auteur

Distribution, Recherche d'Information et Mobilité

Catherine Roussey

Fonction : Auteur
PersonId : 185012
IdHAL : catherine-roussey
ORCID : 0000-0002-3076-5499
IdRef : 091645638

Technologies et systèmes d'information pour les agrosystèmes

Sylvie Calabretto

Fonction : Auteur
PersonId : 7155
IdHAL : sylvie-calabretto
ORCID : 0000-0002-4597-4680
IdRef : 061333654

Distribution, Recherche d'Information et Mobilité

Cyril Dumoulin

Fonction : Auteur

Aucun

Résumé

This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. How relevant the representation phase is, the more relevant the classification will be. We propose a representation model that exploits both the structure and the content of document. Our approach is based on vector space model: a document is represented by a vector of weighted features. Each feature is a couple of (tag, term). We have expanded tf*idf to calculate feature's weight according to term's structural level in the document. SVM has been used as learning algorithm. Experimentation on Reuters collection shows that our proposition improves classification performance compared to the standard classification model based on term vector.

Mots clés

DOCUMENT XML LANGAGE XML CLASSIFICATION MACHINE A VECTEURS DE SUPPORT

Domaines

Langage de programmation [cs.PL]

Fichier principal

CF2010-PUB00029029.pdf (35.75 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Import Ws Irstea : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00585914

Soumis le : jeudi 14 avril 2011-11:00:46

Dernière modification le : mardi 12 mars 2024-10:46:25

Archivage à long terme le : vendredi 15 juillet 2011-02:40:06

Dates et versions

hal-00585914 , version 1 (14-04-2011)

Identifiants

HAL Id : hal-00585914 , version 1
IRSTEA : PUB00029029

Citer

Samaneh Chagheri, Catherine Roussey, Sylvie Calabretto, Cyril Dumoulin. XML Document Classification using SVM. SFC'2010 (Société Francophone de Classification), Jun 2010, Saint Denis de la Réunion, France. pp.71-74. ⟨hal-00585914⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UNIV-LYON1 UNIV-LYON2 INSA-LYON EC-LYON IRSTEA LIRIS LABEXIMU INSA-GROUPE UDL INRAE TSCF MATHNUM

278 Consultations

184 Téléchargements

XML Document Classification using SVM

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager