Automatic Web Pages Author Extraction

Sahar Changuel; Nicolas Labroche; Bernadette Bouchon-Meunier

doi:10.1007/978-3-642-04957-6_26

Communication Dans Un Congrès Année : 2009

Automatic Web Pages Author Extraction

(1) , (1) , (1)

Sahar Changuel

Fonction : Auteur
PersonId : 895674

Machine Learning and Information Retrieval

Nicolas Labroche

Fonction : Auteur
PersonId : 4509
IdHAL : nicolas-labroche
ORCID : 0000-0002-2794-2124
IdRef : 132080303

Machine Learning and Information Retrieval

Bernadette Bouchon-Meunier

Fonction : Auteur
PersonId : 9708
IdHAL : bernadette-bouchon-meunier
ORCID : 0000-0002-7937-7796
IdRef : 031064442

Machine Learning and Information Retrieval

Résumé

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.

Domaines

Intelligence artificielle [cs.AI]

Sahar Changuel : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00577127

Soumis le : mercredi 16 mars 2011-13:30:12

Dernière modification le : mardi 11 avril 2023-15:16:28

Dates et versions

hal-00577127 , version 1 (16-03-2011)

Identifiants

HAL Id : hal-00577127 , version 1
DOI : 10.1007/978-3-642-04957-6_26

Citer

Sahar Changuel, Nicolas Labroche, Bernadette Bouchon-Meunier. Automatic Web Pages Author Extraction. FQAS 2009 - 8th International Conference on Flexible Query Answering Systems, Oct 2009, Roskilde, Denmark. pp.300-311, ⟨10.1007/978-3-642-04957-6_26⟩. ⟨hal-00577127⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS LIP6 SORBONNE-UNIVERSITE SU-SCIENCES

93 Consultations

0 Téléchargements

Automatic Web Pages Author Extraction

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager