Contrasting Twitter Collection methodologies during Extreme Events

Overview | People | Collaborators | Sponsors | Publications | Tools

Project Goal

Different strategies were applied to collect Tweets during Hurricane Harvey and Irma, two environmental disasters that affected the United States in 2017. These collection strategies not only recover data at different spatial resolutions, but also obtain significantly different types of users, number of tweets per user, etc. As such, completely different networks can be constructed based on these approaches and special care need to be taken when drawing conclusions from them. The purpose of this project is to characterize these differences and determine how to leverage them to obtain tweets relevant to extreme events at a higher spatial resolution.

Results were presented as a poster (HA/DR) conference and the CASOS Summer Institute.

Methods Used

Tweets where collected during each event, following two different methodologies: text-based search with key terms related to each event and bounding boxes around the affected areas. To filter the tweets that were discussing events related to each Hurricane, we classified them by the usage of terms in the following "disaster categories" in the 4 major languages spoken in the area (English, Spanish, French and Portuguese): Corruption, Basic Needs, Top Hashtags (related to the event) and mention of a disaster event.

Expected Results

Given the high volumes of tweets produced during these disaster-type events, both collection methodologies have low overlay. This can provide a measure of the spatial specificity of the search terms.

Differences in the Use of Disaster related terms:

Contrasting Twitter 1Contrasting Twitter 2
Spikes in the use of terms in each of the disaster categories tend to occur first in the tweets collected via bounding box. Also, as expected there is a considerably higher proportion of geolocated tweets obtained via bounding box (regardless of the disaster category).


The different collection strategies applied not only sample different types of users, but also produce significantly different network structures. These differences are observed under both events and tend to be exacerbated during usage peaks. Text-based collection appears to be more reactionary, as spikes in the use of disaster-related terms occur first in tweets collected via bounding-box. This differences can be leveraged by combining both collection methods in order to obtain a more comprehensive sample. A promising strategy, to increase the geographic precision of Text-based searches, is to periodically update the search parameters based on locally relevant topics (based on hashtags or terms).