FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2012

FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

Marilena Oita
  • Fonction : Auteur
  • PersonId : 882947

Résumé

Data-intensive Web sites, e.g., blogs or news sites, present pages containing Web articles (a blog post, a news item, etc.). These Web articles, typically automatically generated by a content management system, use a fixed template and variable content. Unsupervised extraction of their content (excluding the boilerplate of Web pages, i.e., their common template) is of interest in many applications, such as indexing or archiving. We present a novel approach for the extraction of Web articles from dynamic Web pages. Our algorithm, Forest, targets the zone of the Web page relevant to some (automatically acquired) keywords for a Web page to obtain structural patterns identifying the content of interest. We consider two potential source of keywords: Web feeds that may link to the Web page, and terms found through a frequency analysis on the Web page itself. These structural patterns are aggregated among different Web pages that use the same layout, and ranked using a new measure of relevance with respect to the set of keywords. We extensively evaluate Forest and report improved results over the state of the art in Web article content extraction.
Fichier principal
Vignette du fichier
richSemanticPaths.pdf (466.94 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00747816 , version 1 (02-11-2012)

Identifiants

  • HAL Id : hal-00747816 , version 1

Citer

Marilena Oita, Pierre Senellart. FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths. 2012. ⟨hal-00747816⟩
241 Consultations
286 Téléchargements

Partager

Gmail Facebook X LinkedIn More