Collection and Indexing of Tweets with a Geographical Focus

Adrien Barbaresi

Communication Dans Un Congrès Année : 2016

Collection and Indexing of Tweets with a Geographical Focus

(1)

Adrien Barbaresi

Fonction : Auteur correspondant
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Connectez-vous pour contacter l'auteur

Austrian Academy of Sciences

Résumé

This paper introduces a Twitter corpus currently focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets. Due to access restrictions, it is not possible to retrieve all available tweets, which is why corpus construction implies a series of decisions described below. The corpus focuses on Austrian users, as data collection grounds on a two-tier detection process addressing corpus construction and user location issues. The emphasis lies on short messages whose sender mentions a place in Austria as his/her hometown or tweets from places located in Austria. The resulting user base is then queried and enlarged using focused crawling and random sampling, so that the corpus is refined and completed in the way of a monitor corpus. Its current volume is 21.7 million tweets from approximately 125,000 users. The tweets are indexed using Elasticsearch and queried via the Kibana frontend, which allows for queries on metadata as well as for the visualization of geolocalized tweets (currently about 3.3% of the collection).

Mots clés

Computer-Mediated Communication Web Corpus Construction Database Solutions Visualization

Domaines

Linguistique Informatique et langage [cs.CL] Web

Fichier principal

Barbaresi_CMLC2016_Twitter.pdf (2.01 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01323274

Soumis le : mardi 4 octobre 2016-16:21:51

Dernière modification le : mercredi 12 décembre 2018-13:32:04

Archivage à long terme le : vendredi 3 février 2017-15:59:55

Dates et versions

hal-01323274 , version 1 (30-05-2016)

hal-01323274 , version 2 (04-10-2016)

hal-01323274 , version 3 (18-10-2016)

Licence

Paternité

Identifiants

HAL Id : hal-01323274 , version 2

Citer

Adrien Barbaresi. Collection and Indexing of Tweets with a Geographical Focus. Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia. pp.24-27. ⟨hal-01323274v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

412 Consultations

1027 Téléchargements

Collection and Indexing of Tweets with a Geographical Focus

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager