Efficient construction of metadata-enhanced web corpora

Adrien Barbaresi

doi:10.18653/v1/W16-2602

Communication Dans Un Congrès Année : 2016

Efficient construction of metadata-enhanced web corpora

(1, 2)

1
2

Adrien Barbaresi

Fonction : Auteur correspondant
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Connectez-vous pour contacter l'auteur

Berlin-Brandenburgische Akademie der Wissenschaften

Austrian Academy of Sciences

Résumé

Metadata extraction is known to be a problem in general-purpose Web corpora, and so is extensive crawling with little yield. The contributions of this paper are threefold: a method to find and download large numbers of WordPress pages; a targeted extraction of content featuring much needed metadata; and an analysis of the documents in the corpus with insights of actual blog uses. The study focuses on a publishing software (WordPress), which allows for reliable extraction of structural elements such as metadata, posts, and comments. The download of about 9 million documents in the course of two experiments leads after processing to 2.7 billion tokens with usable metadata. This comparatively high yield is a step towards more efficiency with respect to machine power and " Hi-Fi " web corpora. The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. However, existing typologies on Web texts have to be revised in the light of this hybrid genre.

Mots clés

Web For Corpus Corpus Linguistics Web Corpus Construction Focused Crawling

Domaines

Linguistique Informatique et langage [cs.CL] Web

Fichier principal

Barbaresi_Efficient-Metadata_WAC_2016_archive.pdf (103.42 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01371704

Soumis le : mardi 18 octobre 2016-17:36:45

Dernière modification le : mercredi 12 décembre 2018-13:32:04

Dates et versions

hal-01371704 , version 1 (26-09-2016)

hal-01371704 , version 2 (18-10-2016)

Licence

Paternité

Identifiants

HAL Id : hal-01371704 , version 2
DOI : 10.18653/v1/W16-2602

Citer

Adrien Barbaresi. Efficient construction of metadata-enhanced web corpora. 10th Web as Corpus Workshop, Association for Computational Linguistics (ACL SIGWAC), Aug 2016, Berlin, Germany. pp.7-16, ⟨10.18653/v1/W16-2602⟩. ⟨hal-01371704v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

188 Consultations

1171 Téléchargements

Efficient construction of metadata-enhanced web corpora

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Altmetric

Partager