Archiving Data Objects using Web Feeds

Marilena Oita; Pierre Senellart

Communication Dans Un Congrès Année : 2010

Archiving Data Objects using Web Feeds

(1, 2) , (2)

1
2

Marilena Oita

Fonction : Auteur correspondant
PersonId : 882947

Connectez-vous pour contacter l'auteur

Inria Saclay - Ile de France

Télécom ParisTech

Pierre Senellart

Fonction : Auteur
PersonId : 11778
IdHAL : pierre-senellart
ORCID : 0000-0002-7909-5369
IdRef : 124713769

Télécom ParisTech

Résumé

Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. In this paper, we show how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel. We monitor Web channels and extract from the associated Web pages the text and references corresponding to Web articles. The result is enriched with the timestamp and additional metadata mined from the feed, and encapsulated in a 'data object'. The data object will be in particular information devoided of all the template elements or advertisements. These irrelevant elements, generically called boileplate, are not only consuming time and space from the crawler's point of view, but also hinder the data analysis process. We first make some statistics on a set of Web feeds, by crawling them for a period of time and observing their temporal aspects. Then we present the algorithm used for article extraction, algorithm that uses the feed semantics (more specifically the description and title of feed items) in order to identify the DOM node in the HTML page that contains the article. The data objects constructed in this way can be used as a semantic overlay collection for an archive or in the context of an incremental crawl, making it more efficient by detecting change at data object level. Experiments on the extraction technique are done in order to validate our approach, with good results even in cases when other techniques fail. We finally discuss useful applications based on the extraction and change detection of Web objects.

Mots clés

Web archiving data object Web feed Web page dynamics

Domaines

Web

Fichier principal

iwawienna.pdf (522.46 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marilena Oita : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00537962

Soumis le : vendredi 19 novembre 2010-17:35:25

Dernière modification le : lundi 9 octobre 2023-12:49:42

Archivage à long terme le : vendredi 26 octobre 2012-16:10:42

Dates et versions

inria-00537962 , version 1 (19-11-2010)

Identifiants

HAL Id : inria-00537962 , version 1

Citer

Marilena Oita, Pierre Senellart. Archiving Data Objects using Web Feeds. International Workshop on Web Archiving, Sep 2010, Vienna, Austria. ⟨inria-00537962⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM INRIA PARISTECH INRIA2

479 Consultations

243 Téléchargements

Archiving Data Objects using Web Feeds

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager