Getting Started with ZenML: Building Production-Ready ML Pipelines

5 min readJan 5, 2025

TLDR;

I’ve created a comprehensive guide about ZenML that covers everything from basic setup to advanced features and best practices. The guide includes:

A thorough introduction to ZenML and its benefits
Step-by-step setup instructions
A complete example pipeline using the iris dataset
Detailed explanations of key concepts
Advanced features like stack components and CI/CD integration
Best practices and monitoring tips

Machine learning pipelines are complex beasts. They involve multiple steps — from data ingestion to preprocessing, training, and deployment — each with their own dependencies and requirements. Enter ZenML, an open-source MLOps framework designed to make building and managing ML pipelines easier and more reproducible.

What is ZenML?

ZenML is a framework that helps data scientists and ML engineers transform their ML workflows into production-ready pipelines. It provides a clean, Python-first API that makes it easy to define, run, and track your ML experiments while following MLOps best practices.

Think of ZenML as the glue that connects different pieces of your ML infrastructure. It handles everything from pipeline versioning to artifact management, making your ML workflows more organized and reproducible.

Why Choose ZenML?

Before diving into the technical details, let’s understand why ZenML might be the right choice for your ML projects:

Framework Agnostic: Works seamlessly with popular ML frameworks like TensorFlow, PyTorch, or scikit-learn.
Cloud Native: Designed to run anywhere — locally, on-premise, or in the cloud.
Extensible: Offers a plugin system that lets you integrate with your existing ML tools and services.
Reproducible: Automatically tracks code, data, and model artifacts for each pipeline run.

Setting Up Your ZenML Environment

Let’s start by setting up ZenML in your development environment. First, you’ll need Python 3.7 or later installed.

# Create and activate a virtual environment
python -m venv zenml-env
source zenml-env/bin/activate  # On Windows, use `zenml-env\Scripts\activate`

# Install ZenML
pip install zenml

# Initialize ZenML
zenml init

The zenml init command creates a .zen directory in your project, which will store your pipeline configurations and local database.

Creating Your First Pipeline

Let’s create a simple ML pipeline that demonstrates ZenML’s core concepts. We’ll build a pipeline that:

Loads data
Preprocesses it
Trains a model
Evaluates the results

Here’s how to implement this:

from zenml import pipeline, step
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

@step
def load_data():
    """Load the dataset."""
    # For this example, we'll use the iris dataset
    from sklearn.datasets import load_iris
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target
    return df

@step
def preprocess_data(df: pd.DataFrame):
    """Split and preprocess the data."""
    X = df.drop('target', axis=1)
    y = df['target']
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

@step
def train_model(
    X_train: pd.DataFrame,
    y_train: pd.Series
) -> RandomForestClassifier:
    """Train the model."""
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    return model

@step
def evaluate_model(
    model: RandomForestClassifier,
    X_test: pd.DataFrame,
    y_test: pd.Series
) -> float:
    """Evaluate the model and return the accuracy."""
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    print(f"Model accuracy: {accuracy:.2f}")
    return accuracy

@pipeline
def training_pipeline():
    """Define the pipeline steps."""
    df = load_data()
    X_train, X_test, y_train, y_test = preprocess_data(df)
    model = train_model(X_train, y_train)
    accuracy = evaluate_model(model, X_test, y_test)

To run the pipeline:

if __name__ == "__main__":
    training_pipeline()

Understanding Key ZenML Concepts

Let’s break down the important concepts we just used:

Steps

Steps are the building blocks of your pipeline. Each step is a Python function decorated with @step that performs a specific task. Steps can have inputs and outputs, and ZenML automatically tracks these artifacts.

Pipelines

A pipeline is a collection of steps that form your ML workflow. The @pipeline decorator helps ZenML understand how your steps are connected and manages their execution.

Artifacts

Every output from a step becomes an artifact that ZenML automatically versions and stores. This makes it easy to track what data was used for each pipeline run.

Advanced Features

Once you’re comfortable with basic pipelines, ZenML offers many advanced features:

Stack Components

ZenML uses “stacks” to manage your ML infrastructure. A stack consists of components like:

Orchestrators (local, Kubeflow, Airflow)
Artifact stores (local, S3, GCS)
Experiment trackers (MLflow, Weights & Biases)

Here’s how to configure a custom stack:

from zenml.integrations.mlflow.mlflow_experiment_tracker import MLFlowExperimentTracker

# Register and activate a stack with MLflow tracking
!zenml experiment-tracker register mlflow_tracker --flavor=mlflow
!zenml stack register my_stack -e mlflow_tracker
!zenml stack set my_stack

Pipeline Configurations

You can customize pipeline behavior using configurations:

@pipeline(enable_cache=False, name="training_pipeline_v2")
def training_pipeline():
    # Pipeline steps here
    pass

Continuous Integration

ZenML integrates well with CI/CD tools. Here’s a simple GitHub Actions workflow:

name: ZenML Pipeline
on: [push]
jobs:
  run-pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
      - name: Install dependencies
        run: |
          pip install zenml
          pip install -r requirements.txt
      - name: Run pipeline
        run: python run_pipeline.py

Best Practices

When working with ZenML, keep these best practices in mind:

Version Control: Always version your pipeline code and configurations.
Requirements Management: Maintain a requirements.txt file with exact versions -even better- pyproject.toml file.
Documentation: Document your steps and pipelines using docstrings.
Modularity: Keep steps focused on single tasks for better reusability.
Error Handling: Implement proper error handling in your steps.

Monitoring and Debugging

ZenML provides several ways to monitor and debug your pipelines:

# List all pipeline runs
zenml pipeline runs list

# Get details about a specific run
zenml pipeline runs get <run_id>

# View run artifacts
zenml artifact list

You can also use the ZenML dashboard for a visual interface:

zenml up

This starts a local dashboard where you can view pipeline runs, artifacts, and metrics.

Conclusion

ZenML provides a solid foundation for building production-ready ML pipelines. By following this guide, you’ve learned how to:

Set up ZenML in your environment
Create basic pipelines with steps
Use advanced features like custom stacks
Follow best practices for ML pipeline development

As you continue working with ZenML, explore its integrations with other tools in the ML ecosystem. The framework’s flexibility and extensibility make it a valuable addition to any MLOps toolkit.

Remember, the key to successful ML pipelines is not just getting them to work, but making them reproducible, maintainable, and scalable. ZenML helps achieve these goals with its structured approach to pipeline development.