Machine Learning Pipelines with Google Cloud Platform (Part 2)

Pipeline Photo

In my previous post, I showed how to create a pipeline on GCP using Vertex AI. The header was “Machine Learning Pipelines with Google Cloud Platform” but I’ve gotta admit, I lied. There were no machine learning in it; it was merely a summation and multiplication and you probably noticed it too. But it served a purpose: You know the gears turning behind Vertex AI, mentality and working principles of it.

In this post, I am going to show you the path goes through data ingestion to model evaluation. But worry not(or do worry); there are much to do in MLOps engineer’s job description. Spoiler alert: continuous training and continuous deployment.

Note: Since this is the second part, I assume you created a billing account & project and enabled neccesary APIs on GCP. If you have any problem, feel free to contact me on any platform. Enough talking, let’s get into coding. As once Linus Torvalds said

Talk is cheap.Show me the code.

Import necessary libraries.

Importing necessary libraries

Keep an eye out for those Artifact, Dataset, Input etc. imports! Those are important for passing artifacts between kubeflow components.

Then set up the GCP environment same as before.

Next, instantiate kubeflow pipelines api client:

Wait a second, where did we get that host link? I asked the same question when I was reading official Google docs. Then I run to my colleague and he told me to deploy kubeflow pipelines from marketplace. Then it gives you a link for Kubeflow Pipelines dashboard.

If you ever feel lost after deploying kubeflow pipelines, you can head back to AI Platform Pipelines and get the link via clicking open pipelines dashboard. The link is on your url bar.

Moving on with fetching the data, I used BigQuery client api but you can use any other method. Kubeflow components are just wrappers for your Python functions.

Remember from first post, each component will be a container. Therefore we need to set up the environment right. For that we told which libraries to install like we did it in a DockerFile. Afterwards, defining our function but we implicitly tell it what input artifacts to get, which output artifacts to return. This function return two Output[Dataset] artifacts. Other steps are pretty straightforward.

BigQuery stores tables hierarchically(Project>Dataset>Table).Define project, dataset and table for the BigQuery table we want. Get the table and turn it into dataframe. Use sklearn to split data into train and test.

Notice that function didn't return anything. We are going to reach datasets via Artifact path. For one more time; components gets and returns artifacts. We’ll pass get_data().outputs[“dataset_train”] while calling, it will access the dataset_train object, and download it by using its path attribute.

If you are familiar with modelling the code is pretty trivial; except model_artifact.metadata part. We reach Model artifact's metadata and assign desired values to it. In this case, I assigned training score(R-squared) and framework for demo purposes. You can assign other things as long as they are primitive data types like int, float, str.

This component will take two inputs: Dataset and Model; and returns Metrics. We reached model's metadata and assigned our score which we can observe from pipelines. Then we run the pipeline like we always do:

Head to Vertex AI pipelines and observe your pipeline status.

Vertex AI Pipelines status
Vertex AI Pipelines Dashboard GIF

You can inspect your pipeline after it says “success”. You can see run durations, metrics and artifacts even logs. Therefore you can easily spot bottlenecks and potential problems.

If you notice the times my pipeline runs, you'll see a pattern: It runs every day at 7:30. This way, model can see new data and learn new insights just before I start my shift. Even more, you only need a small chunk of code:

If you are having trouble with time regex, you can check crontab guru.

And that's it folks, we have fetched data, trained a model, evaluated the model and scheduled our pipeline to run automatically every morning right before we get our first cup of coffee. Next post, we will be working on deploying our newly trained model on Google Cloud Platform and Vertex AI.

AI Engineer / Master’s student in Data Science