Skip to Main content Skip to Navigation
Conference papers

Collection and Indexing of Tweets with a Geographical Focus

Adrien Barbaresi 1, *
* Corresponding author
Abstract : This paper introduces a Twitter corpus currently focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets. Due to access restrictions, it is not possible to retrieve all available tweets, which is why corpus construction implies a series of decisions described below. The corpus focuses on Austrian users, as data collection grounds on a two-tier detection process addressing corpus construction and user location issues. The emphasis lies on short messages whose sender mentions a place in Austria as his/her hometown or tweets from places located in Austria. The resulting user base is then queried and enlarged using focused crawling and random sampling, so that the corpus is refined and completed in the way of a monitor corpus. Its current volume is 21.7 million tweets from approximately 125,000 users. The tweets are indexed using Elasticsearch and queried via the Kibana frontend, which allows for queries on metadata as well as for the visualization of geolocalized tweets (currently about 3.3% of the collection).
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download
Contributor : Adrien Barbaresi <>
Submitted on : Tuesday, October 18, 2016 - 5:43:37 PM
Last modification on : Wednesday, December 12, 2018 - 1:32:04 PM


Files produced by the author(s)


Distributed under a Creative Commons Attribution 4.0 International License


  • HAL Id : hal-01323274, version 3



Adrien Barbaresi. Collection and Indexing of Tweets with a Geographical Focus. Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia. pp.24-27. ⟨hal-01323274v3⟩



Record views


Files downloads