Feature transforms

All Data Scientists are probably used to doing feature engineering in a Jupyter Notebook, as part of getting to know the data and trying different features for the best model prediction.

Something you only realize when you step outside of the Notebook is that you need to perform the exact processing steps during serving time.

Feature engineering

For example, if during training one of the engineered features is produced through a computation that included the standard deviation of another feature, this has the implication that: – we either have to save that global value so that we make it available at inference time – or we perform a whole pass through the data again at inference time Either way, we need to produce that global value somehow at serving time.

Something that comes back to mind in this context: I remember working on a text dataset and making some time consuming calls to some text processing libraries. I remember waiting a few minutes for the whole dataset to be processed in this way, so I could extract the language or the country of users from that dataset.

Back then I only thought about the annoyance of having to wait that long before my script proceeds to the next section, but not realizing this will happen at inference time too. It’s true, at inference time I will have one reuqest to predict for, not a whole dataset to process. But still, I did not truly realize the downside of a costly call for data preprocessing (timewise expensive) and I did not compute how how this will add to the total inference time.

Some feature transformations cannot be avoided. For example, we know that model converge faster and more reliably when numerical data has been normalized. Dimensionality reduction we know helps with reducing computing resources and with enhancing the predictive quality of the data.

Machine learning in production: feature engineering

Feature engineering at scale in production

Ideally, the same code should be used in both environments (training and production). For example, using Python code for creating and training models in Notebooks and then deploying to a Java environment and translating all the feature engineering from the Notebook into Java is far from ideal and it’s looking for trouble.

The advisable way to do it is using a pipeline, which is a unified framework for both training and deploying.

What to pay attention to when preprocessing at scale:

  1. inconsistencies in transformations between training and serving

These are hard to detect after they have creeped into the system.

The usual sources of trouble: – training & serving code are different (like in the Python vs Java example above) – there are different deployment scenarios in the system: Mobile / Server / Web browser

Inconsistencies => training-serving skew => lowers model performance (if we’re unlucky).

If we’re lucky, the model will give results that are completely off and it will be obvious that there’s a problem.

2. granularity

Some transformations can be done at sample level, others need a full path through the data (like discussed in previous sections). Standardizing, bucketizing and minmax need a full path. Expanding features can usually be done at instance level.

At serving time we can only do instance-level transformations.

It’s also common to perform transformations per batch. This is especially popular when the training dataset is huge. Normalizing per batch is completely acceptable, even though there are differences from batch to batch., as long as we keep that in mind, we should be safe.

3. optimizing the transformations for one sample

I touched upon this point in a previous section, when I mentioned the time-consuming processing I was doing on text data.

This is relevant for both training and serving time. It may seen an obvious consideration for serving, but a bit surprizing for the training context.

Here's why it’ important. If we’re training on an expensive machine in the cloud, the transformations are usually performed by the CPU (cheap), while the heavy computations are done by the GPU / TPU (aka the accelerators, which are the expensive hardware parts). While the CPU transforms, the accelerators will be idle (unused, but still paid for). To counteract this downside, some frameworks introduced methods to prefetch the next batch (for the CPU to work on it, while the GPU is still working on the current batch). Something to keep in mind.

If the training dataset is huge, it’s good to start with a small parts of the data, work out the issues and slowly progress to the whole dataset.

There are frameworks for large-scale data processing that also help with the considerations above.

Frameworks for large-scale data processing

I’m starting my exploration of this space with TFX. I aim to be adding more tools as I continue to discover them.

TensorFlow Extended (TFX)

TFX is developed by Google and integrates, of course, with the Google Cloud Platform (or GCP).

Here’s what happens in the diagram above: – ExampleGen splits the training data into train and eval sets – StatisticsGen computes stats on the dataset (distribution type, std, mean, max for each feature etc) – SchemaGen automatically creates a schema for our data (e.g. which feature is of what type) – Example Validator detects if some of our data validates the schema (e.g. we expect an int, but got a float) – Transform will perform the data engineering. – the Trainer is training a model – the Evaluator assesses the model’s performs – and the Pusher deploys it to the environment we chose.

The schema generates by SchemaGen should be reviews / improved by a human.

The transform performs operations it is programmed to do, again, by an ML Engineer.

The Transform module produces both the transformed data and something called a Transform Graph. The graph expresses the transformation we apply on our data, but as a Tensorflow Graph.

Going deeper into tf.Transform

This is a more detailed schematic of the most important processing steps that showed up in the previous image.

Going deeper into tf.Transform

Important thing to note: tf.Transform generates a graph that will be applied at inference time. There is no need for extra code for that. The ML Engineer provides code for transformation to be done on training data and the Transform module generates the graph that will be used later for predictions.

What this ensures is having the same transformations applied at training time and at serving time and irrespective of the deployment platform.

Transform uses something called Analyzers. What Analyzer can do: – scaling – buckets (aka bucketizing) – text processing (bag of words, tfidf) – dimensionality reduction (pca)

Summing up what tf.Transform offers: – pre-process input data and create features – define pipelines for large-scale data pre-processing

Other MLOps platforms to manage Machine Learning lifecycle

Amazon SageMaker Azure Machine Learning Google Cloud AI Platform

Last updated