# Labeling

Traditionally: manual labeling. Can be done in-house or externalized.&#x20;

According to Andrew Ng, when there is only a couple of hundreds / thousands samples to be labeled, it might be a good idea to have the ML / Data Scientist to it herself. It's a good oportunity to get to know the data too. But above that level, it becomes expensive for the company and frustrating for the ML engineer.&#x20;

I can attest to that. When I had to label images of cropped handwritting and add bounding boxes around individual words, 3000 was my limit. I did appreciate the intimacy with data though. It helped me afterwards when I had to write software to synthesize 300k images from scraped fonts  and make them look as similar as possible to real human writing.&#x20;

Externalized labeling:\
\- platforms for paid crowd-sourcing: cheaper\
\- subject matter experts: expensive\
\
Labelling can also be automated:\
\- semi-supervised learning\
\- active learning \
\- weak supervision with Snorkel

### Semi-supervised learning

* a small pool of labeled data&#x20;
* and a large amount of unlabelled data.

Uses clustering of the labeled data (assumed to have some commonalities in structure) to extend labeling to the new data.

### Active learning

* intelligent sampling algorithms, when it's too expensive or time consuming label the whole dataset
* sample data so as to get the most informative data points for training
  * use the subset for training
  * use the subset to do label propagation to extend labeling to more samples (semi-supervised learning)<br>
* **intelligent sampling** techniques:
  * **margin sampling**:&#x20;

    * start with a few labeled samples
    * train and infer a decision boundary
    * select the closest samples to the boundary to be labeled next
    * repeat from step 1

    Margin sampling helps achive some form of training set information saturation faster. That is, it finds a subset of the training set that is sufficient for training a model with the highest accuracy that can be obtaned from that dataset.

    ![](https://2760274863-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWg18uALijEQIaYx0mKHu%2Fuploads%2FfSjU5Yjk0RQ1cilgv60x%2Fimage.png?alt=media\&token=38ed3b0a-4dfc-42ff-85bf-c88d97e2c2ed)
  * **cluster-based sampling**
    * cluster data
    * select a diverse set (to cover all clusters)
  * **query-by-committee**
    * train an ensemble of models
    * choose the samples with most disagreement
  * region-based sampling
    * divide the feature space into several regions&#x20;
    * run one active learning algorithm in each region

### Weak supervision

* start with unlabeled data
* apply heuristics designed by subject matter experts (SMEs) to generate 'noisy' labels (labels that have a <1 probabiliy of being correct)
* train a generative model to de-noise the labels and assign importance weights to different heuristics
* the de-noised labeled data can then be used as usual

Example of heuristics:

* emails containing a lot of CAPS are spam
* emails that contain a bunch of different formatting (italic, bold, underline) and different colors (black, blue, red) are spam

Snorkel is the most widely used tool for weak supervision.
