Mihaela Grigore
  • 👋About
  • 👩‍🏭Personal projects
    • Computer Vision | Deep Learning with Tensorflow & Keras (ResNet50, GPU training)
    • Computer Vision | Convolutional Neural Networks with PyTorch
    • Computer Vision | Facial Recognition with Keras, FaceNet, Inception, Siamese Networks
    • NLP | Topic modeling on tweets
    • NLP | Sentiment analysis of tweets: TextBlob, VADER and Flair
    • Time series | Exploration on Crypto price dataset
    • Data scraping | Social Media Scraping: Twitter Developer API for Academics
    • Data Scraping | Collecting historical tweets without Twitter API
  • ✍️Notes
    • Machine Learning in Production
      • Feature transforms
      • Feature selection
      • Data journey
    • NLP
      • Information Retrieval
    • Computer Vision
    • Time series
      • Stationarity
    • Data
      • Labeling
    • Python
      • ndarray slicing with index out of bounds
  • 📚Readings & other media
    • Computer Vision
      • Selection of research articles
    • NLP
      • Handwriting Text
      • Information Retrieval
      • Mono- / multilingual
      • Topic Modeling
      • Language Models
    • Time Series
    • Generative Adversarial Netoworks (GAN)
    • Python
      • Python basics
Powered by GitBook
On this page
  • Semi-supervised learning
  • Active learning
  • Weak supervision
  1. Notes
  2. Data

Labeling

Ways to label datasets

PreviousDataNextPython

Last updated 3 years ago

Traditionally: manual labeling. Can be done in-house or externalized.

According to Andrew Ng, when there is only a couple of hundreds / thousands samples to be labeled, it might be a good idea to have the ML / Data Scientist to it herself. It's a good oportunity to get to know the data too. But above that level, it becomes expensive for the company and frustrating for the ML engineer.

I can attest to that. When I had to label images of cropped handwritting and add bounding boxes around individual words, 3000 was my limit. I did appreciate the intimacy with data though. It helped me afterwards when I had to write software to synthesize 300k images from scraped fonts and make them look as similar as possible to real human writing.

Externalized labeling: - platforms for paid crowd-sourcing: cheaper - subject matter experts: expensive Labelling can also be automated: - semi-supervised learning - active learning - weak supervision with Snorkel

Semi-supervised learning

  • a small pool of labeled data

  • and a large amount of unlabelled data.

Uses clustering of the labeled data (assumed to have some commonalities in structure) to extend labeling to the new data.

Active learning

  • intelligent sampling algorithms, when it's too expensive or time consuming label the whole dataset

  • sample data so as to get the most informative data points for training

    • use the subset for training

    • use the subset to do label propagation to extend labeling to more samples (semi-supervised learning)

  • intelligent sampling techniques:

    • margin sampling:

      • start with a few labeled samples

      • train and infer a decision boundary

      • select the closest samples to the boundary to be labeled next

      • repeat from step 1

      Margin sampling helps achive some form of training set information saturation faster. That is, it finds a subset of the training set that is sufficient for training a model with the highest accuracy that can be obtaned from that dataset.

    • cluster-based sampling

      • cluster data

      • select a diverse set (to cover all clusters)

    • query-by-committee

      • train an ensemble of models

      • choose the samples with most disagreement

    • region-based sampling

      • divide the feature space into several regions

      • run one active learning algorithm in each region

Weak supervision

  • start with unlabeled data

  • apply heuristics designed by subject matter experts (SMEs) to generate 'noisy' labels (labels that have a <1 probabiliy of being correct)

  • train a generative model to de-noise the labels and assign importance weights to different heuristics

  • the de-noised labeled data can then be used as usual

Example of heuristics:

  • emails containing a lot of CAPS are spam

  • emails that contain a bunch of different formatting (italic, bold, underline) and different colors (black, blue, red) are spam

Snorkel is the most widely used tool for weak supervision.

✍️