FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Marilena Oita; Pierre Senellart

Pré-Publication, Document De Travail Année : 2012

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

(1) , (1, 2)

1
2

Marilena Oita

Fonction : Auteur
PersonId : 882947

Télécom ParisTech

Pierre Senellart

Fonction : Auteur
PersonId : 11778
IdHAL : pierre-senellart
ORCID : 0000-0002-7909-5369
IdRef : 124713769

Télécom ParisTech

Département Informatique et Réseaux

Résumé

Data-intensive Web sites, e.g., blogs or news sites, present pages containing Web articles (a blog post, a news item, etc.). These Web articles, typically automatically generated by a content management system, use a fixed template and variable content. Unsupervised extraction of their content (excluding the boilerplate of Web pages, i.e., their common template) is of interest in many applications, such as indexing or archiving. We present a novel approach for the extraction of Web articles from dynamic Web pages. Our algorithm, Forest, targets the zone of the Web page relevant to some (automatically acquired) keywords for a Web page to obtain structural patterns identifying the content of interest. We consider two potential source of keywords: Web feeds that may link to the Web page, and terms found through a frequency analysis on the Web page itself. These structural patterns are aggregated among different Web pages that use the same layout, and ranked using a new measure of relevance with respect to the set of keywords. We extensively evaluate Forest and report improved results over the state of the art in Web article content extraction.

Mots clés

content extraction keyword structural similarity DOM ranking

Domaines

Recherche d'information [cs.IR]

Fichier principal

richSemanticPaths.pdf (466.94 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marilena Oita : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00747816

Soumis le : vendredi 2 novembre 2012-10:55:47

Dernière modification le : lundi 9 octobre 2023-12:49:39

Archivage à long terme le : dimanche 3 février 2013-03:36:08

Dates et versions

hal-00747816 , version 1 (02-11-2012)

Identifiants

HAL Id : hal-00747816 , version 1

Citer

Marilena Oita, Pierre Senellart. FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths. 2012. ⟨hal-00747816⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM PARISTECH INFRES

241 Consultations

286 Téléchargements

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager