Twitter Analytics for Global Health - Exploratory Analysis and Visualisation
25 Apr 2018
Ryan Nazareth

In the first part of this blog, we have seen how to build an Natural Language Processing pipeline in Python to extract meanigful snakebite related tweets. We will now see what insights we can generate through exploration of the data using visualisations. The figures below show word clouds and bar charts of top word frequency generated using the ggplot2 library in R, after cleaning and tokenising the entire corpus of tweets into unigrams (single words such as "project", "health") and bigrams (two word sequences such as "state government", "ntd approval" etc.) and excluding the keywords used for searching the tweets i.e. ‘snake’, ‘snakebite’, ‘venom’. The top 5 most common single words (frequency above 150) were ‘gombe’, ‘n8m’, ‘approves’, ‘venomous’, ‘antivenom’ whilst the top 5 bigrams (frequency above 100) were ‘approves n8m’, ‘gombe approves’, ‘gombe state’, ‘state government’, ‘government approves’. This is related to tweets about the Gombe state in Nigeria approving N8 million project funding for research into developing more effective local snake anti-venom, which were heavily mentioned by a number of users. By inspecting the unigram and bigram word clouds further, we can also see other words related to snakebite disease such as ‘health’, ‘envenoming’, ‘deadly’, ‘treat’ ‘venomous bites’, ‘ntd’, ‘nigeria’ and more specific incident related words such as ‘protect owner’, ‘courageous dog’. There is also evidence of research themes in the corpus through words such as ‘professor working’, ‘carbon monoxide’, ‘funding’, ‘project’, due to the large number of tweets related to publications and cutting-edge research into snakebite treatment around the world. 



The figure below shows the top 10 users who accumulated the most favourites and retweets (with a lower limit of 30 favourites or 30 retweets) during this period.The users who were accumulating the most favourites and retweets came from:

  • News organisations or online publishers such as Mashable , New York Times, WebMD (@mashable, @nytimes, @WebMD)
  • Global health organisations such as the London School of Tropical Medicine, Médecins Sans Frontières, Health Action International (@LSTM, @MSFsci, @WebMD, @HAImedicines)
  • Global health researchers or policy advisors like Nick Casewell, Peter Hotez, Julien Potet (@nickcasewell, @PeterHotez, @julienpotet)
  • Bot or anonymous accounts (@Belxab, @ALT_uscis). 

The New York Times was very active during this period and given its global popularity and online following, its not suprising that it got almost 750 retweets and favourites during this period. Interestingly, the Spanish bot account @Belxab generated the second most number of favourites and retweets (276) from only a single tweet “Efecto del veneno de serpiente sobre la sangre” (which translates as “The effect of snake venom on blood”).


The world map built using the Javascript charting library D3, shows the aggregated tweets for all 6 months plotted using the geo-encoded coordinate locations, with the number of favourites and retweets encoded by colour and size of circles respectively. A tweet with a larger number of favourites and retweets was represented by a larger size pink circle, whilst a tweet with few retweets and favourites was encoded by a smaller red circle.  Initial exploration showed that the most tweets came from users in developed countries like US, UK, Australia, Western Europe as well as countries with advanced IT infrastructure like India.  Underdeveloped countries like Nigeria, parts of East and Southern Africa also had a high concentration of tweets. A smaller proportion of tweets also orginated from countries in South East Asia namely China, Japan, Phillipines, Malaysia and Thailand.

In the UK, tweets were originating mainly from North West (Liverpool), Oxford, Cambridge and London. These are associated with the high concentration of tropical medicine organisations and researchers in these cities like the Liverpool School of Tropical Medicne and Nick Casewell in Liverpool and the Royal Society of Tropical Medicine and Hygeine in London. In Nigeria, tweets in Lagos in the west and Abuja in central region were mainly from local news accounts. None of the users in the UK and Nigeria were in the list of the top 10 favourited users during the entire 6-month period.

The first version of an interactive visualisation incorporating a number of different charts can be found here.  The time slider at the bottom of the map allows the user to filter the data by different monthly intervals from July to December (the month of January 2018 was excluded as the dataset only contains tweets for the first three days). By using the play button, the user can quickly visualise the demographic spread of tweet activity for each month. This also updates the bar chart displaying the top 10 most favourited tweets on the bottom right.  The dual axis plot shows the tweet frequency across the 6-month period and the proportion of positive sentiment tweets. This is still work in progress and I intend to change the design of the visualisation and add additional features (this will be updated in my GitHub repository, accessible here)

Ryan is a Data Scientist at Manta Ray Media. He has a keen interest in natural language processing and big data and visualisation applications in global health. For more information or questions regarding this blog, please visit the Github repository here| Twitter: @rkn0386