ML pipelines are great at automating end-to-end ML workflows, but what if you want to go one step further and automate the execution of your pipeline? In this post I’ll show you how to do exactly that. You’ll learn how to trigger your Vertex Pipelines runs in response to data added to a BigQuery table.
What are ML pipelines? A quick refresher
If the term ML pipeline is throwing you for a loop, you’re not alone! Let’s first understand what that means and the tools we’ll be using to implement it. ML pipelines are part of the larger practice of MLOps, which is concerned with productionizing ML workflows in a reproducible, reliable way.
When you’re building out an ML system and have established steps for gathering and preprocessing data, and model training, deployment, and evaluation, you might start by building out these steps as ad-hoc, disparate processes. You may want to share the workflow you’ve developed with another team and ensure they get the same results as you when they go through the steps. This will be tricky if your ML steps aren’t connected, and that’s where pipelines can help. With pipelines, you define your ML workflow as a series of steps or components. Each step in a pipeline is embodied by a container, and the output of each step will be fed as input to the next step.
How do you build a pipeline? There are open source libraries that do a lot of this heavy lifting by providing tooling for expressing and connecting pipeline steps and converting them to containers. Here I’ll be using Vertex Pipelines, a serverless tool for building, monitoring, and running ML pipelines. The best part? It supports pipelines built with two popular open source frameworks: Kubeflow Pipelines (which I’ll use here) and Tensorflow Extended (TFX).
Compiling Vertex Pipelines with the Kubeflow Pipelines SDK
This post assumes you’ve already defined a pipeline that you’d like to automate. Let’s imagine you’ve done this using the Kubeflow Pipelines SDK. Once you’ve defined your pipeline, the next step is to compile it. This will generate a JSON file with your pipeline definition that you’ll use when running the pipeline.