Labeling
Ways to label datasets
Last updated
Ways to label datasets
Last updated
Traditionally: manual labeling. Can be done in-house or externalized.
According to Andrew Ng, when there is only a couple of hundreds / thousands samples to be labeled, it might be a good idea to have the ML / Data Scientist to it herself. It's a good oportunity to get to know the data too. But above that level, it becomes expensive for the company and frustrating for the ML engineer.
I can attest to that. When I had to label images of cropped handwritting and add bounding boxes around individual words, 3000 was my limit. I did appreciate the intimacy with data though. It helped me afterwards when I had to write software to synthesize 300k images from scraped fonts and make them look as similar as possible to real human writing.
Externalized labeling: - platforms for paid crowd-sourcing: cheaper - subject matter experts: expensive Labelling can also be automated: - semi-supervised learning - active learning - weak supervision with Snorkel
a small pool of labeled data
and a large amount of unlabelled data.
Uses clustering of the labeled data (assumed to have some commonalities in structure) to extend labeling to the new data.
intelligent sampling algorithms, when it's too expensive or time consuming label the whole dataset
sample data so as to get the most informative data points for training
use the subset for training
use the subset to do label propagation to extend labeling to more samples (semi-supervised learning)
intelligent sampling techniques:
margin sampling:
start with a few labeled samples
train and infer a decision boundary
select the closest samples to the boundary to be labeled next
repeat from step 1
Margin sampling helps achive some form of training set information saturation faster. That is, it finds a subset of the training set that is sufficient for training a model with the highest accuracy that can be obtaned from that dataset.
cluster-based sampling
cluster data
select a diverse set (to cover all clusters)
query-by-committee
train an ensemble of models
choose the samples with most disagreement
region-based sampling
divide the feature space into several regions
run one active learning algorithm in each region
start with unlabeled data
apply heuristics designed by subject matter experts (SMEs) to generate 'noisy' labels (labels that have a <1 probabiliy of being correct)
train a generative model to de-noise the labels and assign importance weights to different heuristics
the de-noised labeled data can then be used as usual
Example of heuristics:
emails containing a lot of CAPS are spam
emails that contain a bunch of different formatting (italic, bold, underline) and different colors (black, blue, red) are spam
Snorkel is the most widely used tool for weak supervision.