Automatic Web Pages Author Extraction - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2009

Automatic Web Pages Author Extraction

Résumé

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.

Dates et versions

hal-00577127 , version 1 (16-03-2011)

Identifiants

Citer

Sahar Changuel, Nicolas Labroche, Bernadette Bouchon-Meunier. Automatic Web Pages Author Extraction. FQAS 2009 - 8th International Conference on Flexible Query Answering Systems, Oct 2009, Roskilde, Denmark. pp.300-311, ⟨10.1007/978-3-642-04957-6_26⟩. ⟨hal-00577127⟩
93 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More