FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths
Résumé
Data-intensive Web sites, e.g., blogs or news sites, present pages containing Web articles (a blog post, a news item, etc.). These Web articles, typically automatically generated by a content management system, use a fixed template and variable content. Unsupervised extraction of their content (excluding the boilerplate of Web pages, i.e., their common template) is of interest in many applications, such as indexing or archiving. We present a novel approach for the extraction of Web articles from dynamic Web pages. Our algorithm, Forest, targets the zone of the Web page relevant to some (automatically acquired) keywords for a Web page to obtain structural patterns identifying the content of interest. We consider two potential source of keywords: Web feeds that may link to the Web page, and terms found through a frequency analysis on the Web page itself. These structural patterns are aggregated among different Web pages that use the same layout, and ranked using a new measure of relevance with respect to the set of keywords. We extensively evaluate Forest and report improved results over the state of the art in Web article content extraction.
Domaines
Recherche d'information [cs.IR]
Origine : Fichiers produits par l'(les) auteur(s)
Loading...