Mihaela Grigore
  • 👋About
  • 👩‍🏭Personal projects
    • Computer Vision | Deep Learning with Tensorflow & Keras (ResNet50, GPU training)
    • Computer Vision | Convolutional Neural Networks with PyTorch
    • Computer Vision | Facial Recognition with Keras, FaceNet, Inception, Siamese Networks
    • NLP | Topic modeling on tweets
    • NLP | Sentiment analysis of tweets: TextBlob, VADER and Flair
    • Time series | Exploration on Crypto price dataset
    • Data scraping | Social Media Scraping: Twitter Developer API for Academics
    • Data Scraping | Collecting historical tweets without Twitter API
  • ✍️Notes
    • Machine Learning in Production
      • Feature transforms
      • Feature selection
      • Data journey
    • NLP
      • Information Retrieval
    • Computer Vision
    • Time series
      • Stationarity
    • Data
      • Labeling
    • Python
      • ndarray slicing with index out of bounds
  • 📚Readings & other media
    • Computer Vision
      • Selection of research articles
    • NLP
      • Handwriting Text
      • Information Retrieval
      • Mono- / multilingual
      • Topic Modeling
      • Language Models
    • Time Series
    • Generative Adversarial Netoworks (GAN)
    • Python
      • Python basics
Powered by GitBook
On this page
  1. Personal projects

Data scraping | Social Media Scraping: Twitter Developer API for Academics

PreviousTime series | Exploration on Crypto price datasetNextData Scraping | Collecting historical tweets without Twitter API

Last updated 3 years ago

Social Media Scraping

See

Scraping historical tweets with Twitter API v2

I conducted a research study of the interaction and perception of different stakeholders of the education system with the international assesment PISA (Programme for International Student Assessment). More precisely, I set out to discover the answers to these research questions:

Research Question 1. How is the discussion around the PISA examination represented in Twitter discourse and how did the discourse evolve over time ? Research Question 2. What is the demographic profile of the participants to this online conversation ?

To answer these questions, I turned to social media, where all categories of stakeholders voice opinions (students, teachers, parents, institutions etc) and where information is freely available.

I needed to scrape all historical tweets while filtering for specific keywords I identified as being relevant.

The first tool I tested was snscrape. I wrote another article on and .

Snscrape was easy to get started with (easy to use, good documentation, plenty of examples), but after performing the data collection I thought I go too few tweets. I can't day for sure how many tweets is a right amount, but I reasoned as follows: if I only get 5.000 tweets for 2012 related to PISA tests results, then it means that at most 5.000 people tweeted globally using my keywords . This seems like a low number, given that PISA is the most widely known and discussed of all the ILSAs (International Large Scale Assessments). To get an idea of its magnitude: in 2018 there was 79 participating countries.

Starting from this reasoning and considering that I found no resource that would estimate the percentage of results collected by snscrape from the full archive of historical tweets, I decided to look into Twitter's new .

This repository contains one Jupyter Notebook where you can find the sections below. If those are something you're interested in, explore the notebook. Also, feel free to get in touch if you have specific questions about this work / need help with your own (research) project.

  1. Getting started with Academic Research product track

  2. Pagination of results for historical tweets collection

  3. Tweet fields

  4. Scraping rate limits

  5. Retweeted tweets and truncation

  6. Twitter API v2 Expansions

  7. Mass collection of historical tweets for multiple keywords

  8. Load tweets from the CSV file into pandas DataFrame for analysis

  9. Removing duplicate tweets

  10. User location from Twitter data

  11. Tweets preliminary data analysis

  12. What's next ? Sentiment analysis

Libraries needed:

requests
nltk
seaborn
geotext
geograpy3
pycountry
CountryInfo
👩‍🏭
project repository on GitHub.
how to scrape historical tweets using without the need of a Twitter developer account using snscrape
this is its GitHub repo
Academic Research product track
image