Bayesian Network Model for Semi-Structured Document Classification

Ludovic Denoyer 1 Patrick Gallinari 1
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.
Document type :
Journal articles
Complete list of metadatas
Contributor : Lip6 Publications <>
Submitted on : Tuesday, July 7, 2015 - 10:48:41 AM
Last modification on : Friday, May 24, 2019 - 5:23:50 PM



Ludovic Denoyer, Patrick Gallinari. Bayesian Network Model for Semi-Structured Document Classification. Information Processing and Management, Elsevier, 2004, 40 (5), pp.807-827. ⟨10.1016/j.ipm.2004.04.009⟩. ⟨hal-01172241⟩



Record views