Collection and Indexing of Tweets with a Geographical Focus - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Collection and Indexing of Tweets with a Geographical Focus

Résumé

This paper introduces a Twitter corpus currently focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets. Due to access restrictions, it is not possible to retrieve all available tweets, which is why corpus construction implies a series of decisions described below. The corpus focuses on Austrian users, as data collection grounds on a two-tier detection process addressing corpus construction and user location issues. The emphasis lies on short messages whose sender mentions a place in Austria as his/her hometown or tweets from places located in Austria. The resulting user base is then queried and enlarged using focused crawling and random sampling, so that the corpus is refined and completed in the way of a monitor corpus. Its current volume is 21.7 million tweets from approximately 125,000 users. The tweets are indexed using Elasticsearch and queried via the Kibana frontend, which allows for queries on metadata as well as for the visualization of geolocalized tweets (currently about 3.3% of the collection).
Fichier principal
Vignette du fichier
Barbaresi_CMLC2016_Twitter.pdf (2.01 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-01323274 , version 1 (30-05-2016)
hal-01323274 , version 2 (04-10-2016)
hal-01323274 , version 3 (18-10-2016)

Licence

Paternité

Identifiants

  • HAL Id : hal-01323274 , version 2

Citer

Adrien Barbaresi. Collection and Indexing of Tweets with a Geographical Focus. Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia. pp.24-27. ⟨hal-01323274v2⟩
412 Consultations
1027 Téléchargements

Partager

Gmail Facebook X LinkedIn More