In the first part of this blog, we have seen how to build an Natural Language Processing pipeline in Python to extract meanigful snakebite related tweets. We will now see what insights we can generate through exploration of the data using visualisations. The figures below show word clouds and bar charts of top word frequency generated using the ggplot2 library in R, after cleaning and tokenising the entire corpus of tweets into unigrams (single words such as "project", "health") and bigrams (two word sequences such as "state government", "ntd approval" etc.) and excluding the keywords used for searching the tweets i.e. ‘snake’, ‘snakebite’, ‘venom’. The top 5 most common single words (frequency above 150) were ‘gombe’, ‘n8m’, ‘approves’, ‘venomous’, ‘antivenom’ whilst the top 5 bigrams (frequency above 100) were ‘approves n8m’, ‘gombe approves’, ‘gombe state’, ‘state government’, ‘government approves’. This is related to tweets about the Gombe state in Nigeria approving N8 million project funding for research into developing more effective local snake anti-venom, which were heavily mentioned by a number of users. By inspecting the unigram and bigram word clouds further, we can also see other words related to snakebite disease such as ‘health’, ‘envenoming’, ‘deadly’, ‘treat’ ‘venomous bites’, ‘ntd’, ‘nigeria’ and more specific incident related words such as ‘protect owner’, ‘courageous dog’. There is also evidence of research themes in the corpus through words such as ‘professor working’, ‘carbon monoxide’, ‘funding’, ‘project’, due to the large number of tweets related to publications and cutting-edge research into snakebite treatment around the world.
The figure below shows the top 10 users who accumulated the most favourites and retweets (with a lower limit of 30 favourites or 30 retweets) during this period.The users who were accumulating the most favourites and retweets came from:
- News organisations or online publishers such as Mashable , New York Times, WebMD (@mashable, @nytimes, @WebMD)
- Global health organisations such as the London School of Tropical Medicine, Médecins Sans Frontières, Health Action International (@LSTM, @MSFsci, @WebMD, @HAImedicines)
- Global health researchers or policy advisors like Nick Casewell, Peter Hotez, Julien Potet (@nickcasewell, @PeterHotez, @julienpotet)
- Bot or anonymous accounts (@Belxab, @ALT_uscis).
The New York Times was very active during this period and given its global popularity and online following, its not suprising that it got almost 750 retweets and favourites during this period. Interestingly, the Spanish bot account @Belxab generated the second most number of favourites and retweets (276) from only a single tweet “Efecto del veneno de serpiente sobre la sangre” (which translates as “The effect of snake venom on blood”).
In the UK, tweets were originating mainly from North West (Liverpool), Oxford, Cambridge and London. These are associated with the high concentration of tropical medicine organisations and researchers in these cities like the Liverpool School of Tropical Medicne and Nick Casewell in Liverpool and the Royal Society of Tropical Medicine and Hygeine in London. In Nigeria, tweets in Lagos in the west and Abuja in central region were mainly from local news accounts. None of the users in the UK and Nigeria were in the list of the top 10 favourited users during the entire 6-month period.
The first version of an interactive visualisation incorporating a number of different charts can be found here. The time slider at the bottom of the map allows the user to filter the data by different monthly intervals from July to December (the month of January 2018 was excluded as the dataset only contains tweets for the first three days). By using the play button, the user can quickly visualise the demographic spread of tweet activity for each month. This also updates the bar chart displaying the top 10 most favourited tweets on the bottom right. The dual axis plot shows the tweet frequency across the 6-month period and the proportion of positive sentiment tweets. This is still work in progress and I intend to change the design of the visualisation and add additional features (this will be updated in my GitHub repository, accessible here)
Ryan is a Data Scientist at Manta Ray Media. He has a keen interest in natural language processing and big data and visualisation applications in global health. For more information or questions regarding this blog, please visit the Github repository here. | Twitter: @rkn0386