Archiving Data Objects using Web Feeds - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Archiving Data Objects using Web Feeds

Résumé

Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. In this paper, we show how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel. We monitor Web channels and extract from the associated Web pages the text and references corresponding to Web articles. The result is enriched with the timestamp and additional metadata mined from the feed, and encapsulated in a 'data object'. The data object will be in particular information devoided of all the template elements or advertisements. These irrelevant elements, generically called boileplate, are not only consuming time and space from the crawler's point of view, but also hinder the data analysis process. We first make some statistics on a set of Web feeds, by crawling them for a period of time and observing their temporal aspects. Then we present the algorithm used for article extraction, algorithm that uses the feed semantics (more specifically the description and title of feed items) in order to identify the DOM node in the HTML page that contains the article. The data objects constructed in this way can be used as a semantic overlay collection for an archive or in the context of an incremental crawl, making it more efficient by detecting change at data object level. Experiments on the extraction technique are done in order to validate our approach, with good results even in cases when other techniques fail. We finally discuss useful applications based on the extraction and change detection of Web objects.

Domaines

Web
Fichier principal
Vignette du fichier
iwawienna.pdf (522.46 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

inria-00537962 , version 1 (19-11-2010)

Identifiants

  • HAL Id : inria-00537962 , version 1

Citer

Marilena Oita, Pierre Senellart. Archiving Data Objects using Web Feeds. International Workshop on Web Archiving, Sep 2010, Vienna, Austria. ⟨inria-00537962⟩
479 Consultations
243 Téléchargements

Partager

Gmail Facebook X LinkedIn More