Document classification: Combining structure and content

Abstract : Technical documentation such as user manual and manufacturing document is now an important part of the industrial production. Indeed, without such documents, the products can neither be manufactured nor used according to their complexity. Therefore, the increasing volume of such documents stored in the electronic format, needs an automatic classification system in order to categorize them in pre-defined classes and to retrieve the information quickly. On the other hand, these documents are strongly structured and contain the elements like tables and schemas. However, the traditional document classification typically classifies the documents considering the document text and ignoring its structural elements. In this paper, we propose a method which makes use of structural elements to create the document feature vector for classification. A feature in this vector is a combination of the term and the structure. The document structure is represented by the tags of the XML document. The SVM algorithm has been used as learning and classifying algorithm.
Complete list of metadatas

Cited literature [14 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00637665
Contributor : Import Ws Irstea <>
Submitted on : Wednesday, November 2, 2011 - 4:01:08 PM
Last modification on : Thursday, February 7, 2019 - 2:48:57 PM
Long-term archiving on : Friday, February 3, 2012 - 2:30:34 AM

File

CF2011-PUB00032426.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00637665, version 1
  • IRSTEA : PUB00032426

Citation

Samaneh Chagheri, Sylvie Calabretto, Catherine Roussey, Cyril Dumoulin. Document classification: Combining structure and content. 13th International Conference on Entreprise Information Systems (ICEIS), Jun 2011, Beijing, China. p. - p. ⟨hal-00637665⟩

Share

Metrics

Record views

768

Files downloads

399