Using visual pages analysis for optimizing web archiving - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Using visual pages analysis for optimizing web archiving

Myriam Ben Saad
  • Fonction : Auteur
  • PersonId : 968945
Stéphane Gançarski

Résumé

Due to the growing importance of the World Wide Web, archiving it has become crucial for preserving useful source of information. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. Also, querying the archive can take more time due to the large set of useless page versions stored. Thus, an effective method is required to know accurately when and how often important changes between versions occur in order to efficiently archive web pages. Our work focuses on addressing this requirement through a new web archiving approach that detects important changes between page versions. This approach consists in archiving the visual layout structure of a web page represented by semantic blocks. This work seeks to describe the proposed approach and to examine various related issues such as using the importance of changes between versions to optimize web crawl scheduling. The major interesting research questions that we would like to address in the future are introduced.

Dates et versions

hal-01292035 , version 1 (22-03-2016)

Identifiants

Citer

Myriam Ben Saad, Stéphane Gançarski. Using visual pages analysis for optimizing web archiving. In EDBT/ICDT 2010 Ph.D. Workshop, Mar 2010, Lausanne, Switzerland. pp.43, ⟨10.1145/1754239.1754287⟩. ⟨hal-01292035⟩
30 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More