Labeling

Ways to label datasets

Traditionally: manual labeling. Can be done in-house or externalized.

According to Andrew Ng, when there is only a couple of hundreds / thousands samples to be labeled, it might be a good idea to have the ML / Data Scientist to it herself. It's a good oportunity to get to know the data too. But above that level, it becomes expensive for the company and frustrating for the ML engineer.

I can attest to that. When I had to label images of cropped handwritting and add bounding boxes around individual words, 3000 was my limit. I did appreciate the intimacy with data though. It helped me afterwards when I had to write software to synthesize 300k images from scraped fonts and make them look as similar as possible to real human writing.

Externalized labeling: - platforms for paid crowd-sourcing: cheaper - subject matter experts: expensive Labelling can also be automated: - semi-supervised learning - active learning - weak supervision with Snorkel

Semi-supervised learning

a small pool of labeled data
and a large amount of unlabelled data.

Uses clustering of the labeled data (assumed to have some commonalities in structure) to extend labeling to the new data.

Active learning

intelligent sampling algorithms, when it's too expensive or time consuming label the whole dataset
sample data so as to get the most informative data points for training
- use the subset for training
- use the subset to do label propagation to extend labeling to more samples (semi-supervised learning)
intelligent sampling techniques:
- margin sampling:
  - start with a few labeled samples
  - train and infer a decision boundary
  - select the closest samples to the boundary to be labeled next
  - repeat from step 1
  Margin sampling helps achive some form of training set information saturation faster. That is, it finds a subset of the training set that is sufficient for training a model with the highest accuracy that can be obtaned from that dataset.
- cluster-based sampling
  - cluster data
  - select a diverse set (to cover all clusters)
- query-by-committee
  - train an ensemble of models
  - choose the samples with most disagreement
- region-based sampling
  - divide the feature space into several regions
  - run one active learning algorithm in each region

Weak supervision

start with unlabeled data
apply heuristics designed by subject matter experts (SMEs) to generate 'noisy' labels (labels that have a <1 probabiliy of being correct)
train a generative model to de-noise the labels and assign importance weights to different heuristics
the de-noised labeled data can then be used as usual

Example of heuristics:

emails containing a lot of CAPS are spam
emails that contain a bunch of different formatting (italic, bold, underline) and different colors (black, blue, red) are spam

Snorkel is the most widely used tool for weak supervision.

PreviousData NextPython

Last updated 3 years ago