Twitter Analytics for Global Health - Data Collection and Processing
20 Apr 2018
Ryan Nazareth

Social media platforms like Twitter have been a popular choice to gain insights through the vast amount of data available . In addition, the acquisition of data in real time makes this a preferred option when investigating real time trends in “hot topics” compared to alternative sources such as news articles, surveys etc. In the global health domain, twitter is a popular medium used by organisations, researchers, general public to voice opinions regarding different global health topics. One of the key events and milestones in global health during 2017 was the inclusion of snakebite to the list of Neglected Tropical Diseases (NTDs) by the World Health Organisation (WHO) in June 2017. The insights generated through mining of Twitter data, related to discussions around problems, interventions, incidents and drug treatment breakthroughs specific to snakebite can help build a picture relating to disease burden and socioeconomic costs, in real time. This is the first part of two blogs discussing a data science R&D project carried out at Manta Ray Media Ltd, investigating what insights could be generated through mining snakebite related tweets.

Data Collection 

Twitter’s Standard Search API, was used to collect the data for this study.  The API allows tweets to be collected based on search queries and has strict limitations on the number of tweets sampled (restricted to a week and rate limited for a 15-minute window). The python code snippet below shows how theTwitterSearch package in Python is used to make calls to the API after acquiring a new user access token, following registration of a new Twitter application. Each search query is a keyword or sequence of words thought to be most relevant to the topic of snakebite such as  ‘#snakebite’, ‘snake anti-venom’, ‘snakebite ntd’, ‘snakebite’, ‘snake venom’. After a few trail runs of data collection, the keywords  ‘snake venom’ and ‘snakebite’ were excluded from the list as they generated more unrelated tweets compared to the other search queries. 

1750 tweets were acquired for a 6-month period between 3rd July 2017 to 3rd January 2018. The columns of the dataset included ‘the user twitter handle’,‘number of favourites’, ‘number of retweets’,  ‘location’, 'timezone', ‘whether the tweet was retweeted or not’, ‘tweet text’. A quick observation of tweets showed that a common trend of topics being discussed was apparent. These seem to broadly fall into main categories such as new research developments/publications, incidents, safety advice, facts regarding snake species and government/political issues such as funding for research or antivenom supply.

Original tweet
#Snakebites cost #SriLanka more than $10 million
250 people died from snake bites. No anti-venom, apparently budget issues. 250 people died in 2 weeks. That's a scandal.
Animal heroes: Pit Bull saves children from venomous copperhead  snake #infinitefireinc #pitbullsaveskidsfromsnake? https://t.co/ZdL8dVDvws
As summer is coming i wonder how much anti-venom we have in our hospitals & clinics for quick treatment of snake bites..
Australian mom discovered a large venomous snake terrorizing  her kid's Lego city
Danish research could revolutionise snakebite treatment https://t.co/uFhLBYDeTL #snakebite
Senate wants ministry to provide anti-venom for snake bite victims https://t.co/EL7hqE0CeQ
The King Cobra is the longest snake in the world with the ability to inject venom. They can grow up to 5.6 m (18.5 ft) in length. https://t.co/TXgdbAcy0P
The Quest to Modernize Snakebite Medicine - Wall Street Journal (subscription) https://t.co/muXs6UqTKf

 

Data Cleaning and Preparation

Tweet Filtering 

Alhough the search queries were based on keywords selected to be most appropriate to snakebite, further filtering is required to account for words which could be used in different context (and unrelated to snakebite topics).  For example, the word ‘snakebite’ could mean multiple things: nickname of a darts player, a drink,  a facial piercing etc.  Throughout the 6-month data acquisition process, a 'filter' word list  was created to filter out tweets concerning unrelated topics. An example fo a few words in this list include  ‘tattoo’, ‘bar’, ‘ring’, ‘darts’ etc. Tweets were also fitlered out if they were posted by certain users who tweeted frequently on topics not related to snakebite disease. Finally, any tweets starting with ‘RT’ (retweeted by another user) were removed, to keep only the original tweets. The following python code shows two functions defined to filter out users and tweets based on a filter list.

 

Stop Word Removal, Stemming and Lemmatisation

Non-alphabetic characters, punctuation and URLs were removed from the tweets. In addition, stop words (commonly used words such as "the", "a", "an", "in") were removed as they do not provide any valuable information. The words were then converted to lower case to avoid redundancy, followed by stemming and lemmatisation. Stemming is the process of converting a word into its root form by removing its suffix. Lemmatization is the process of converting a word to its “lemma” or dictionary form (e.g. words going, gone, went are all converted to go).  The following python code snippet shows how this process is implemented using the nltk package in Python.

 

Geocoding 

Users have the option of manually inputting their current geographical location or enable their GPS. Some users may choose not to post their location details or post fake details for which reason 300 out of 1750 tweets did not have any location details associated with it. To plot the locations on a map, the location details needed to be converted to latitude and longitude coordinates.  The code snippet below shows how the geocode function in the ggmap library in the R programming language was used to make calls to the Google Maps API to decode the latitude and longitude coordinates. If only the country name is specified (rather than the city or region), the latitude and longitude coordinates are set as the centre of the country.

In the second part of this blog, we will see what insights we can generate through visualisations.


Ryan is a Data Scientist at Manta Ray Media. He has a keen interest in natural language processing and big data and visualisation applications in global health. For further information or questions related to this blog, please visit the Github repository here| Twitter: @rkn0386