Labeling

Ways to label datasets

Traditionally: manual labeling. Can be done in-house or externalized.

According to Andrew Ng, when there is only a couple of hundreds / thousands samples to be labeled, it might be a good idea to have the ML / Data Scientist to it herself. It's a good oportunity to get to know the data too. But above that level, it becomes expensive for the company and frustrating for the ML engineer.

I can attest to that. When I had to label images of cropped handwritting and add bounding boxes around individual words, 3000 was my limit. I did appreciate the intimacy with data though. It helped me afterwards when I had to write software to synthesize 300k images from scraped fonts and make them look as similar as possible to real human writing.

Externalized labeling: - platforms for paid crowd-sourcing: cheaper - subject matter experts: expensive Labelling can also be automated: - semi-supervised learning - active learning - weak supervision with Snorkel

Semi-supervised learning

  • a small pool of labeled data

  • and a large amount of unlabelled data.

Uses clustering of the labeled data (assumed to have some commonalities in structure) to extend labeling to the new data.

Active learning

  • intelligent sampling algorithms, when it's too expensive or time consuming label the whole dataset

  • sample data so as to get the most informative data points for training

    • use the subset for training

    • use the subset to do label propagation to extend labeling to more samples (semi-supervised learning)

  • intelligent sampling techniques:

    • margin sampling:

      • start with a few labeled samples

      • train and infer a decision boundary

      • select the closest samples to the boundary to be labeled next

      • repeat from step 1

      Margin sampling helps achive some form of training set information saturation faster. That is, it finds a subset of the training set that is sufficient for training a model with the highest accuracy that can be obtaned from that dataset.

    • cluster-based sampling

      • cluster data

      • select a diverse set (to cover all clusters)

    • query-by-committee

      • train an ensemble of models

      • choose the samples with most disagreement

    • region-based sampling

      • divide the feature space into several regions

      • run one active learning algorithm in each region

Weak supervision

  • start with unlabeled data

  • apply heuristics designed by subject matter experts (SMEs) to generate 'noisy' labels (labels that have a <1 probabiliy of being correct)

  • train a generative model to de-noise the labels and assign importance weights to different heuristics

  • the de-noised labeled data can then be used as usual

Example of heuristics:

  • emails containing a lot of CAPS are spam

  • emails that contain a bunch of different formatting (italic, bold, underline) and different colors (black, blue, red) are spam

Snorkel is the most widely used tool for weak supervision.

Last updated