Block-based Migration from HTML4 Standard to HTML5 Standard in the Context of Web Archives

Andrés Sanoja 1 Stéphane Gançarski 1
Abstract : Web archives are not exempt of format obsolescence. In the near future Web pages written in HTML4 format, could be obsolete. We will have to choose between two preservation strategies: emulation or migration. The first option is the most evident, however due to the size of the Web and the amount of information that Web archives handle it is not practical. In the other hand migration to HTML5 format seems plausible. This is a challenge because we need to modify a page (in HTML4 format) and include elements that not even exists in this format (as the HTML5 semantic elements). Using the Web page segmentation we show that, with the appropriate granularity, blocks look alike these semantic elements. We present the use our segmentation tool, BoM (Block-o-Matic), for helping achieve the migration of Web pages from HTML4 format to HTML5 format in the context of Web archives. We also present an evaluation framework for Web page segmentation, that helps to produce metrics needed to compare the original and migrated version. If both versions are similar the migration has been successful. We show the experiments and results obtained on a sample of 40 pages. We made the manual segmentations for each page using our MoB tool. Results shows that in the migration process there is no data loss but in the migrated version (after adding the semantic elements) the margin is changed. This is, it adds whitespace that change the elements position, shifting elements slightly on the page. While this is imperceptible to the human eye, for systems it is difficult to handle without previous knowledge of this situation.
SCTC16, May 2016, Caracas, Venezuela
Andrés Sanoja, Stéphane Gançarski. Block-based Migration from HTML4 Standard to HTML5 Standard in the Context of Web Archives. SCTC16, May 2016, Caracas, Venezuela.



