XML Document Classification using SVM

Abstract : This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. How relevant the representation phase is, the more relevant the classification will be. We propose a representation model that exploits both the structure and the content of document. Our approach is based on vector space model: a document is represented by a vector of weighted features. Each feature is a couple of (tag, term). We have expanded tf*idf to calculate feature's weight according to term's structural level in the document. SVM has been used as learning algorithm. Experimentation on Reuters collection shows that our proposition improves classification performance compared to the standard classification model based on term vector.
Document type :
Conference papers
Complete list of metadatas

Cited literature [6 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00585914
Contributor : Import Ws Irstea <>
Submitted on : Thursday, April 14, 2011 - 11:00:46 AM
Last modification on : Thursday, February 7, 2019 - 2:33:05 PM
Long-term archiving on : Friday, July 15, 2011 - 2:40:06 AM

File

CF2010-PUB00029029.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00585914, version 1
  • IRSTEA : PUB00029029

Citation

Samaneh Chagheri, Catherine Roussey, Sylvie Calabretto, Cyril Dumoulin. XML Document Classification using SVM. SFC'2010 (Société Francophone de Classification), Jun 2010, Saint Denis de la Réunion, France. pp.71-74. ⟨hal-00585914⟩

Share

Metrics

Record views

362

Files downloads

114