Mihaela Grigore
  • 👋About
  • 👩‍🏭Personal projects
    • Computer Vision | Deep Learning with Tensorflow & Keras (ResNet50, GPU training)
    • Computer Vision | Convolutional Neural Networks with PyTorch
    • Computer Vision | Facial Recognition with Keras, FaceNet, Inception, Siamese Networks
    • NLP | Topic modeling on tweets
    • NLP | Sentiment analysis of tweets: TextBlob, VADER and Flair
    • Time series | Exploration on Crypto price dataset
    • Data scraping | Social Media Scraping: Twitter Developer API for Academics
    • Data Scraping | Collecting historical tweets without Twitter API
  • ✍️Notes
    • Machine Learning in Production
      • Feature transforms
      • Feature selection
      • Data journey
    • NLP
      • Information Retrieval
    • Computer Vision
    • Time series
      • Stationarity
    • Data
      • Labeling
    • Python
      • ndarray slicing with index out of bounds
  • 📚Readings & other media
    • Computer Vision
      • Selection of research articles
    • NLP
      • Handwriting Text
      • Information Retrieval
      • Mono- / multilingual
      • Topic Modeling
      • Language Models
    • Time Series
    • Generative Adversarial Netoworks (GAN)
    • Python
      • Python basics
Powered by GitBook
On this page
  1. Notes
  2. Machine Learning in Production

Data journey

How to think of data flowing through a pipeline and what is produced along the way. Exemplified on TensorFlow Extended

PreviousFeature selectionNextNLP

Last updated 3 years ago

We can think of an ML in production as a data journey: data comes from a source, it gets transformed (feature engineering, transforms), it serves to train a model and it becomes predictions.

ML pipeline: Scoping --> Data --> Modeling --> Deployment

At each stage, artefacts are produced (data and associated objects, like schemas, models and metrics) and metadata as well (a sort of equivalent of logging from software development).

This chain of transformations is also referred to as data provenance or lineage.

Benefits of tracking these transformations: - debugging - explanability - compliance with regulations - data versioning: version control of datasets (just like GitHub helps with code versioning and Terraform with environment versioning).

Tools for data versioning are still young, but DVC and Git-LFS are rising.

I'll look at the data journey through the TensorFlow Extended (TFX) ML pipeline now.

The role of metadata

It helps tracking changes and later debugging, like logging does for software developing.

Back to the TFX data pipeline architecture from the section.

The executor does the work that the respective component is supposed to do.

The driver brings in the metadata from the previous module.

The publisher stores the metadata produced by the current module.

TFX uses the ML Metadata library. The library can also be used outside of an ML pipeline. The advantage of using it in a pipeline is that one doesn't have to know / care for the getting and storing of metadata. The Driver and the publisher do their work on their own.

ML Metadata terminology Units artifacts - unit of data (input or output of a component) execution - a record of running a component of the pipeline + its associated runtime parameters context - conceptual grouping of artifacts and executions for one type of component. Types ArtifactType, ExecutionType, ContextType Relationships (between) Event (artifact and execution) Attribution (artifact and context) Association (execution and context)

All the above are building blocks of a system that needs to store metadata.

By storing this matadate, one can:

  • produce directed acyclic graphs (DAG) of component exectuions in pipeline

  • check which inputs were used in an execution

  • list artifacts generated in a specific experiment (e.g. the models trained)

  • compare artifacts

Each of the orange components has this structure

A more complete story of the Metadata world as it is implemented in TFX (reproduced from the official ), where all the nomenclature from the previous paragraph can be seen 'in action'

✍️
👇
👇
TensorFlow website
Feature transforms
A high-level overview of the various components that are part of MLMD