Coherence-oriented Crawling and Navigation for Web Archives using Patterns

Myriam Ben Saad 1 Zeynep Pehlivan 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions: a priori and a posteriori. As a priori solution, our idea is to crawl sites during the off-peak hours (i.e. the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an a posteriori solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01286251
Contributor : Lip6 Publications <>
Submitted on : Thursday, March 10, 2016 - 3:09:51 PM
Last modification on : Thursday, March 21, 2019 - 2:43:17 PM

Links full text

Identifiers

Citation

Myriam Ben Saad, Zeynep Pehlivan, Stéphane Gançarski. Coherence-oriented Crawling and Navigation for Web Archives using Patterns. International Conference on Theory and Practice of Digital Libraries, TPDL 2011, Sep 2011, Berlin, Germany. pp.421-433, ⟨10.1007/978-3-642-24469-8_42⟩. ⟨hal-01286251⟩

Share

Metrics

Record views

61