Archiving theWeb using Changes Patterns : a Case Study

Myriam Ben Saad 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
Type de document :
Article dans une revue
International Journal on Digital Libraries, Springer Verlag, 2012, 13 (1), pp.33-49. 〈10.1007/s00799-012-0094-z〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01185456
Contributeur : Lip6 Publications <>
Soumis le : jeudi 20 août 2015 - 11:21:00
Dernière modification le : jeudi 13 décembre 2018 - 01:49:18

Lien texte intégral

Identifiants

Collections

Citation

Myriam Ben Saad, Stéphane Gançarski. Archiving theWeb using Changes Patterns : a Case Study. International Journal on Digital Libraries, Springer Verlag, 2012, 13 (1), pp.33-49. 〈10.1007/s00799-012-0094-z〉. 〈hal-01185456〉

Partager

Métriques

Consultations de la notice

72