Coherence-oriented Crawling and Navigation for Web Archives using Patterns - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2011

Coherence-oriented Crawling and Navigation for Web Archives using Patterns

Myriam Ben Saad
  • Fonction : Auteur
  • PersonId : 968945
Zeynep Pehlivan
  • Fonction : Auteur
  • PersonId : 971792
Stéphane Gançarski

Résumé

We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions: a priori and a posteriori. As a priori solution, our idea is to crawl sites during the off-peak hours (i.e. the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an a posteriori solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time.

Dates et versions

hal-01286251 , version 1 (10-03-2016)

Identifiants

Citer

Myriam Ben Saad, Zeynep Pehlivan, Stéphane Gançarski. Coherence-oriented Crawling and Navigation for Web Archives using Patterns. International Conference on Theory and Practice of Digital Libraries, TPDL 2011, Sep 2011, Berlin, Germany. pp.421-433, ⟨10.1007/978-3-642-24469-8_42⟩. ⟨hal-01286251⟩
44 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More