Collection and Indexing of Tweets with a Geographical Focus

Adrien Barbaresi 1, *
* Auteur correspondant
Abstract : This paper introduces a Twitter corpus currently focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets. Due to access restrictions, it is not possible to retrieve all available tweets, which is why corpus construction implies a series of decisions described below. The corpus focuses on Austrian users, as data collection grounds on a two-tier detection process addressing corpus construction and user location issues. The emphasis lies on short messages whose sender mentions a place in Austria as his/her hometown or tweets from places located in Austria. The resulting user base is then queried and enlarged using focused crawling and random sampling, so that the corpus is refined and completed in the way of a monitor corpus. Its current volume is 21.7 million tweets from approximately 125,000 users. The tweets are indexed using Elasticsearch and queried via the Kibana frontend, which allows for queries on metadata as well as for the visualization of geolocalized tweets (currently about 3.3% of the collection).
Type de document :
Communication dans un congrès
Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia. pp.24-27, 2016, Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC)
Liste complète des métadonnées

Littérature citée [21 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01323274
Contributeur : Adrien Barbaresi <>
Soumis le : mardi 18 octobre 2016 - 17:43:37
Dernière modification le : mercredi 12 décembre 2018 - 13:32:04

Fichier

Barbaresi_CMLC2016_Twitter_arc...
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

  • HAL Id : hal-01323274, version 3

Collections

Citation

Adrien Barbaresi. Collection and Indexing of Tweets with a Geographical Focus. Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia. pp.24-27, 2016, Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC). 〈hal-01323274v3〉

Partager

Métriques

Consultations de la notice

155

Téléchargements de fichiers

541