Getting Started with ZenML: Building Production-Ready ML Pipelines
TLDR;
I’ve created a comprehensive guide about ZenML that covers everything from basic setup to advanced features and best practices. The guide includes:
- A thorough introduction to ZenML and its benefits
- Step-by-step setup instructions
- A complete example pipeline using the iris dataset
- Detailed explanations of key concepts
- Advanced features like stack components and CI/CD integration
- Best practices and monitoring tips
Machine learning pipelines are complex beasts. They involve multiple steps — from data ingestion to preprocessing, training, and deployment — each with their own dependencies and requirements. Enter ZenML, an open-source MLOps framework designed to make building and managing ML pipelines easier and more reproducible.
What is ZenML?
ZenML is a framework that helps data scientists and ML engineers transform their ML workflows into production-ready pipelines. It provides a clean, Python-first API that makes it easy to define, run, and track your ML experiments while following MLOps best practices.
Think of ZenML as the glue that connects different pieces of your ML infrastructure. It handles everything from pipeline versioning to artifact management, making your ML workflows more organized and reproducible.
Why Choose ZenML?
Before diving into the technical details, let’s understand why ZenML might be the right choice for your ML projects:
- Framework Agnostic: Works seamlessly with popular ML frameworks like TensorFlow, PyTorch, or scikit-learn.
- Cloud Native: Designed to run anywhere — locally, on-premise, or in the cloud.
- Extensible: Offers a plugin system that lets you integrate with your existing ML tools and services.
- Reproducible: Automatically tracks code, data, and model artifacts for each pipeline run.
Setting Up Your ZenML Environment
Let’s start by setting up ZenML in your development environment. First, you’ll need Python 3.7 or later installed.
# Create and activate a virtual environment
python -m venv zenml-env
source zenml-env/bin/activate # On Windows, use `zenml-env\Scripts\activate`
# Install ZenML
pip install zenml
# Initialize ZenML
zenml init
The zenml init
command creates a .zen
directory in your project, which will store your pipeline configurations and local database.
Creating Your First Pipeline
Let’s create a simple ML pipeline that demonstrates ZenML’s core concepts. We’ll build a pipeline that:
- Loads data
- Preprocesses it
- Trains a model
- Evaluates the results
Here’s how to implement this:
from zenml import pipeline, step
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
@step
def load_data():
"""Load the dataset."""
# For this example, we'll use the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
return df
@step
def preprocess_data(df: pd.DataFrame):
"""Split and preprocess the data."""
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test
@step
def train_model(
X_train: pd.DataFrame,
y_train: pd.Series
) -> RandomForestClassifier:
"""Train the model."""
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
return model
@step
def evaluate_model(
model: RandomForestClassifier,
X_test: pd.DataFrame,
y_test: pd.Series
) -> float:
"""Evaluate the model and return the accuracy."""
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Model accuracy: {accuracy:.2f}")
return accuracy
@pipeline
def training_pipeline():
"""Define the pipeline steps."""
df = load_data()
X_train, X_test, y_train, y_test = preprocess_data(df)
model = train_model(X_train, y_train)
accuracy = evaluate_model(model, X_test, y_test)
To run the pipeline:
if __name__ == "__main__":
training_pipeline()
Understanding Key ZenML Concepts
Let’s break down the important concepts we just used:
Steps
Steps are the building blocks of your pipeline. Each step is a Python function decorated with @step
that performs a specific task. Steps can have inputs and outputs, and ZenML automatically tracks these artifacts.
Pipelines
A pipeline is a collection of steps that form your ML workflow. The @pipeline
decorator helps ZenML understand how your steps are connected and manages their execution.
Artifacts
Every output from a step becomes an artifact that ZenML automatically versions and stores. This makes it easy to track what data was used for each pipeline run.
Advanced Features
Once you’re comfortable with basic pipelines, ZenML offers many advanced features:
Stack Components
ZenML uses “stacks” to manage your ML infrastructure. A stack consists of components like:
- Orchestrators (local, Kubeflow, Airflow)
- Artifact stores (local, S3, GCS)
- Experiment trackers (MLflow, Weights & Biases)
Here’s how to configure a custom stack:
from zenml.integrations.mlflow.mlflow_experiment_tracker import MLFlowExperimentTracker
# Register and activate a stack with MLflow tracking
!zenml experiment-tracker register mlflow_tracker --flavor=mlflow
!zenml stack register my_stack -e mlflow_tracker
!zenml stack set my_stack
Pipeline Configurations
You can customize pipeline behavior using configurations:
@pipeline(enable_cache=False, name="training_pipeline_v2")
def training_pipeline():
# Pipeline steps here
pass
Continuous Integration
ZenML integrates well with CI/CD tools. Here’s a simple GitHub Actions workflow:
name: ZenML Pipeline
on: [push]
jobs:
run-pipeline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
- name: Install dependencies
run: |
pip install zenml
pip install -r requirements.txt
- name: Run pipeline
run: python run_pipeline.py
Best Practices
When working with ZenML, keep these best practices in mind:
- Version Control: Always version your pipeline code and configurations.
- Requirements Management: Maintain a
requirements.txt
file with exact versions -even better-pyproject.toml
file. - Documentation: Document your steps and pipelines using docstrings.
- Modularity: Keep steps focused on single tasks for better reusability.
- Error Handling: Implement proper error handling in your steps.
Monitoring and Debugging
ZenML provides several ways to monitor and debug your pipelines:
# List all pipeline runs
zenml pipeline runs list
# Get details about a specific run
zenml pipeline runs get <run_id>
# View run artifacts
zenml artifact list
You can also use the ZenML dashboard for a visual interface:
zenml up
This starts a local dashboard where you can view pipeline runs, artifacts, and metrics.
Conclusion
ZenML provides a solid foundation for building production-ready ML pipelines. By following this guide, you’ve learned how to:
- Set up ZenML in your environment
- Create basic pipelines with steps
- Use advanced features like custom stacks
- Follow best practices for ML pipeline development
As you continue working with ZenML, explore its integrations with other tools in the ML ecosystem. The framework’s flexibility and extensibility make it a valuable addition to any MLOps toolkit.
Remember, the key to successful ML pipelines is not just getting them to work, but making them reproducible, maintainable, and scalable. ZenML helps achieve these goals with its structured approach to pipeline development.