Feature selection

Considerations relating to productionizing machine learning models

Selecting the most relevant features means increasing the signal to noise ratio. And that makes models converge faster and/or leads to better prediction.

But from the perspective of productioning ML models, it also means: - lower storage and I/O requirements - lower inference cost

Both considerations above matter in production, where we might be serving millions of requests or dealing with a high-dimensional feature space.

Quick recap of methods for feature selection: 1. unsupervised - doesn't look at labels (that is, it's agnostic of the relationship between features and target label) - removes highly correlated features to reduce redundancy 2. supervised: 2.1 filter methods - correllations and univariate feature selection 2.2 wrapper methods - sequentially add or remove features (and retrain model at each step) until no further improvements are made. 2.3 embedded - some models already contain an implementation of a feature importance computation mechanism. RandomForestClassifierfromscikit-learn has this. We can piggy back on that to select the best features.

Statistical tests for features comparison

Looking for correlations:

  • Pearson's correlation (linear relationships)

  • Kendal Tau Rank Correlation Coefficient (monotonic relationships)

  • Spearman's Rank Corr. Coeff. (monotonic relationships)

Other methods

  • mutual information

  • F-test

  • Chi-Squared test

Effect of reducing the feature space based on correlations

I experimented with the Breast Cancer Dataset.

# Load data
df = pd.read_csv('breast_cancer_data.csv')

# Data preview
df.head()

There are 33 features in the initial dataset. A few of them are obviously extra, easy to spot (patient id etc). Still, 30 remaining features seem relevant at first sight.

Here is an interesting test: - train a Random Forrest Classifier on the 30-feature dataset - keep only the features 'highly correlated' with the target (the diagnostic label, in this case - the last row in the matrix below). 'highly correlated' means that the correlation coeff is above a certain threshold, which is not a value set in stone. Retrain a Random Forrest Classifier. This would be a supervised filtering (since we're looking at relationships between features and dependent variable). - compute features correlation (each pair of two) and remove features that with a correlation coefficient above a threshold. Two highly correlated features means redundant information for the prediction model. Train another Random Forrest Classifier. This would be unsupervised feature selection, since we're not taking into account the target value. - Compare the results of the three models

cor = df.corr() 

I turned off annotations for this plot (the actual value), because of lack of space to display a large image. The darker the square, the higher the correlation between the two features.

Here are the results on the test set (random 80-20 stratified split of the initial dataset):

There were 30 relevent features on which the first model was trained (first row).

Only the features highly correlated with the target were kept and I trained another Random Forest. The thereshold that worked best was 0.1 or 0.2. When I tried 0.3, I was back at the baseline model performance. There are slight improvement in Accuracy, ROC and Precision.

I removed another 4 features due to redundancy (from a group of highly correlated features, I kept only one). It turns out this leads to a model with the same performance as the second one, but which uses less resources.

scikit-learn also provides the SelectKBest method that automatically selects k features. The available underlying statistical tests on which to base this selection are: - regression : f_regression, mutual_info_regresstion - classification : chi2, f_classif, mutual_info_classif

Last updated