Data scraping | Social Media Scraping: Twitter Developer API for Academics
Last updated
Last updated
See project repository on GitHub.
Scraping historical tweets with Twitter API v2
I conducted a research study of the interaction and perception of different stakeholders of the education system with the international assesment PISA (Programme for International Student Assessment). More precisely, I set out to discover the answers to these research questions:
Research Question 1. How is the discussion around the PISA examination represented in Twitter discourse and how did the discourse evolve over time ? Research Question 2. What is the demographic profile of the participants to this online conversation ?
To answer these questions, I turned to social media, where all categories of stakeholders voice opinions (students, teachers, parents, institutions etc) and where information is freely available.
I needed to scrape all historical tweets while filtering for specific keywords I identified as being relevant.
The first tool I tested was snscrape. I wrote another article on how to scrape historical tweets using without the need of a Twitter developer account using snscrape and this is its GitHub repo.
Snscrape was easy to get started with (easy to use, good documentation, plenty of examples), but after performing the data collection I thought I go too few tweets. I can't day for sure how many tweets is a right amount, but I reasoned as follows: if I only get 5.000 tweets for 2012 related to PISA tests results, then it means that at most 5.000 people tweeted globally using my keywords . This seems like a low number, given that PISA is the most widely known and discussed of all the ILSAs (International Large Scale Assessments). To get an idea of its magnitude: in 2018 there was 79 participating countries.
Starting from this reasoning and considering that I found no resource that would estimate the percentage of results collected by snscrape from the full archive of historical tweets, I decided to look into Twitter's new Academic Research product track.
This repository contains one Jupyter Notebook where you can find the sections below. If those are something you're interested in, explore the notebook. Also, feel free to get in touch if you have specific questions about this work / need help with your own (research) project.
Getting started with Academic Research product track
Pagination of results for historical tweets collection
Tweet fields
Scraping rate limits
Retweeted tweets and truncation
Twitter API v2 Expansions
Mass collection of historical tweets for multiple keywords
Load tweets from the CSV file into pandas DataFrame for analysis
Removing duplicate tweets
User location from Twitter data
Tweets preliminary data analysis
What's next ? Sentiment analysis