Efficient construction of metadata-enhanced web corpora - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Efficient construction of metadata-enhanced web corpora

Résumé

Metadata extraction is known to be a problem in general-purpose Web corpora, and so is extensive crawling with little yield. The contributions of this paper are threefold: a method to find and download large numbers of WordPress pages; a targeted extraction of content featuring much needed metadata; and an analysis of the documents in the corpus with insights of actual blog uses. The study focuses on a publishing software (WordPress), which allows for reliable extraction of structural elements such as metadata, posts, and comments. The download of about 9 million documents in the course of two experiments leads after processing to 2.7 billion tokens with usable metadata. This comparatively high yield is a step towards more efficiency with respect to machine power and " Hi-Fi " web corpora. The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. However, existing typologies on Web texts have to be revised in the light of this hybrid genre.
Fichier principal
Vignette du fichier
Barbaresi_Efficient-Metadata_WAC_2016_archive.pdf (103.42 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01371704 , version 1 (26-09-2016)
hal-01371704 , version 2 (18-10-2016)

Licence

Paternité

Identifiants

Citer

Adrien Barbaresi. Efficient construction of metadata-enhanced web corpora. 10th Web as Corpus Workshop, Association for Computational Linguistics (ACL SIGWAC), Aug 2016, Berlin, Germany. pp.7-16, ⟨10.18653/v1/W16-2602⟩. ⟨hal-01371704v2⟩
188 Consultations
1171 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More