Data journey

How to think of data flowing through a pipeline and what is produced along the way. Exemplified on TensorFlow Extended

We can think of an ML in production as a data journey: data comes from a source, it gets transformed (feature engineering, transforms), it serves to train a model and it becomes predictions.

ML pipeline: Scoping --> Data --> Modeling --> Deployment

At each stage, artefacts are produced (data and associated objects, like schemas, models and metrics) and metadata as well (a sort of equivalent of logging from software development).

This chain of transformations is also referred to as data provenance or lineage.

Benefits of tracking these transformations: - debugging - explanability - compliance with regulations - data versioning: version control of datasets (just like GitHub helps with code versioning and Terraform with environment versioning).

Tools for data versioning are still young, but DVC and Git-LFS are rising.

I'll look at the data journey through the TensorFlow Extended (TFX) ML pipeline now.

The role of metadata

It helps tracking changes and later debugging, like logging does for software developing.

Back to the TFX data pipeline architecture from the Feature transforms section.

The executor does the work that the respective component is supposed to do.

The driver brings in the metadata from the previous module.

The publisher stores the metadata produced by the current module.

TFX uses the ML Metadata library. The library can also be used outside of an ML pipeline. The advantage of using it in a pipeline is that one doesn't have to know / care for the getting and storing of metadata. The Driver and the publisher do their work on their own.

ML Metadata terminology Units artifacts - unit of data (input or output of a component) execution - a record of running a component of the pipeline + its associated runtime parameters context - conceptual grouping of artifacts and executions for one type of component. Types ArtifactType, ExecutionType, ContextType Relationships (between) Event (artifact and execution) Attribution (artifact and context) Association (execution and context)

All the above are building blocks of a system that needs to store metadata.

By storing this matadate, one can:

  • produce directed acyclic graphs (DAG) of component exectuions in pipeline

  • check which inputs were used in an execution

  • list artifacts generated in a specific experiment (e.g. the models trained)

  • compare artifacts

Last updated