CI/CD for Machine Learning

ML CI/CD is different from software CI/CD. You’re not just testing code—you’re testing data, models, and the interaction between them. A model that passes unit tests can still fail in production if the data distribution shifts or performance degrades below acceptable thresholds.

This tutorial builds a complete ML pipeline with GitHub Actions that validates data, trains models, runs performance tests, and deploys to production with blue-green rollouts.

The ML Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        ML CI/CD PIPELINE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐         │
│  │  Code    │   │  Data    │   │  Model   │   │  Model   │         │
│  │ Quality  │──▶│Validation│──▶│ Training │──▶│ Testing  │         │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘         │
│       │              │              │              │                 │
│       ▼              ▼              ▼              ▼                 │
│   - Lint         - Schema       - Train        - Accuracy           │
│   - Types        - Nulls        - Log to       - Latency            │
│   - Unit tests   - Ranges       - MLflow       - Memory             │
│                  - Balance                     - Regression         │
│                                                                      │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐                        │
│  │Integration│──▶│ Deploy   │──▶│ Deploy   │                        │
│  │  Tests   │   │ Staging  │   │Production│                        │
│  └──────────┘   └──────────┘   └──────────┘                        │
│       │              │              │                                │
│       ▼              ▼              ▼                                │
│   - End-to-end   - Smoke       - Blue-green                         │
│   - API tests    - tests       - Canary                             │
│                                - Rollback                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Each stage gates the next. If data validation fails, training doesn’t run. If model tests fail, deployment is blocked.

Part 1: Data Validation

Data issues are the #1 cause of ML failures in production. Catch them early with automated validation.

The Data Validator

# scripts/data_validation.py
import json
import sys
import hashlib
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from pathlib import Path

@dataclass
class ValidationResult:
    """Result of a validation check."""
    name: str
    passed: bool
    message: str
    details: Optional[Dict[str, Any]] = None

@dataclass
class DataValidationReport:
    """Complete validation report."""
    dataset_path: str
    dataset_hash: str
    total_checks: int
    passed_checks: int
    failed_checks: int
    results: List[ValidationResult]

    @property
    def passed(self) -> bool:
        return self.failed_checks == 0

class DataValidator:
    """Validates ML datasets."""

    def validate_schema(self, df, expected_columns: List[str]) -> ValidationResult:
        """Check that expected columns exist."""
        missing = set(expected_columns) - set(df.columns)

        if missing:
            return ValidationResult(
                name="schema_validation",
                passed=False,
                message=f"Missing columns: {missing}",
                details={"missing": list(missing)}
            )
        return ValidationResult(
            name="schema_validation",
            passed=True,
            message=f"All {len(expected_columns)} expected columns present"
        )

    def validate_no_nulls(self, df, columns: List[str]) -> ValidationResult:
        """Check for null values in specified columns."""
        null_counts = df[columns].isnull().sum()
        columns_with_nulls = null_counts[null_counts > 0]

        if len(columns_with_nulls) > 0:
            return ValidationResult(
                name="null_check",
                passed=False,
                message=f"Found nulls in {len(columns_with_nulls)} columns",
                details={"null_counts": columns_with_nulls.to_dict()}
            )
        return ValidationResult(
            name="null_check",
            passed=True,
            message="No null values found in required columns"
        )

    def validate_class_balance(self, df, label_column: str, max_imbalance: float = 0.8) -> ValidationResult:
        """Check that classes aren't too imbalanced."""
        class_counts = df[label_column].value_counts(normalize=True)
        max_ratio = class_counts.max()

        if max_ratio > max_imbalance:
            return ValidationResult(
                name="class_balance",
                passed=False,
                message=f"Class imbalance: {max_ratio:.2%} in majority class",
                details={"distribution": class_counts.to_dict()}
            )
        return ValidationResult(
            name="class_balance",
            passed=True,
            message=f"Class balance acceptable (max {max_ratio:.2%})"
        )

    def validate_min_samples(self, df, min_samples: int) -> ValidationResult:
        """Check minimum sample count."""
        actual = len(df)
        if actual < min_samples:
            return ValidationResult(
                name="min_samples",
                passed=False,
                message=f"Insufficient samples: {actual} < {min_samples}"
            )
        return ValidationResult(
            name="min_samples",
            passed=True,
            message=f"Sample count OK: {actual} >= {min_samples}"
        )

Running Data Validation

Output

============================================================
DATA VALIDATION REPORT
============================================================
Dataset: data/train.csv
Hash: a0b908d5959bc22c
Checks: 4/4 passed

  ✓ schema_validation: All 2 expected columns present
  ✓ null_check: No null values found in required columns
  ✓ min_samples: Sample count OK: 1000 >= 500
  ✓ class_balance: Class balance acceptable (max 51.00%)

VALIDATION PASSED

The data hash (a0b908d5959bc22c) enables caching—if data hasn’t changed, we can skip re-training.

Part 2: Model Testing

Model tests verify that the model meets performance requirements before deployment.

The Model Tester

# scripts/model_testing.py
import time
import torch
import numpy as np
from dataclasses import dataclass
from typing import Dict, Any, Optional, List

@dataclass
class TestResult:
    """Result of a single test."""
    name: str
    passed: bool
    message: str
    actual_value: Optional[float] = None
    threshold: Optional[float] = None

class ModelTester:
    """Tests ML models against requirements."""

    def __init__(self, model, device: str = "cuda"):
        self.model = model
        self.device = device
        self.model.to(device)
        self.model.eval()

    def test_accuracy_threshold(
        self,
        dataloader,
        min_accuracy: float
    ) -> TestResult:
        """Test that model meets accuracy threshold."""
        correct = 0
        total = 0

        with torch.no_grad():
            for batch in dataloader:
                inputs = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)
                labels = batch["label"].to(self.device)

                outputs = self.model(inputs, attention_mask)
                _, predicted = outputs.max(1)
                correct += (predicted == labels).sum().item()
                total += labels.size(0)

        accuracy = correct / total

        return TestResult(
            name="accuracy_threshold",
            passed=accuracy >= min_accuracy,
            message=f"Accuracy {accuracy:.4f} {'≥' if accuracy >= min_accuracy else '<'} {min_accuracy}",
            actual_value=accuracy,
            threshold=min_accuracy
        )

    def test_latency(
        self,
        sample_input: Dict[str, torch.Tensor],
        max_latency_ms: float,
        num_runs: int = 100
    ) -> TestResult:
        """Test inference latency."""
        # Warmup
        for _ in range(10):
            with torch.no_grad():
                self.model(
                    sample_input["input_ids"].to(self.device),
                    sample_input["attention_mask"].to(self.device)
                )

        # Measure
        latencies = []
        for _ in range(num_runs):
            start = time.perf_counter()
            with torch.no_grad():
                self.model(
                    sample_input["input_ids"].to(self.device),
                    sample_input["attention_mask"].to(self.device)
                )
            if self.device == "cuda":
                torch.cuda.synchronize()
            latencies.append((time.perf_counter() - start) * 1000)

        p95_latency = np.percentile(latencies, 95)

        return TestResult(
            name="latency_p95",
            passed=p95_latency <= max_latency_ms,
            message=f"P95 latency {p95_latency:.2f}ms {'≤' if p95_latency <= max_latency_ms else '>'} {max_latency_ms}ms",
            actual_value=p95_latency,
            threshold=max_latency_ms
        )

    def test_memory_usage(self, max_memory_gb: float) -> TestResult:
        """Test GPU memory usage."""
        if self.device != "cuda":
            return TestResult(
                name="memory_usage",
                passed=True,
                message="Memory test skipped (CPU mode)"
            )

        torch.cuda.reset_peak_memory_stats()

        # Run inference
        dummy_input = torch.randint(0, 1000, (1, 128)).to(self.device)
        dummy_mask = torch.ones(1, 128).to(self.device)
        with torch.no_grad():
            self.model(dummy_input, dummy_mask)

        peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)

        return TestResult(
            name="memory_usage",
            passed=peak_memory_gb <= max_memory_gb,
            message=f"Memory {peak_memory_gb:.2f}GB {'≤' if peak_memory_gb <= max_memory_gb else '>'} {max_memory_gb}GB",
            actual_value=peak_memory_gb,
            threshold=max_memory_gb
        )

    def test_no_nan_outputs(self, sample_input: Dict[str, torch.Tensor]) -> TestResult:
        """Test that model doesn't produce NaN outputs."""
        with torch.no_grad():
            output = self.model(
                sample_input["input_ids"].to(self.device),
                sample_input["attention_mask"].to(self.device)
            )

        has_nan = torch.isnan(output).any().item()
        has_inf = torch.isinf(output).any().item()

        return TestResult(
            name="no_nan_outputs",
            passed=not has_nan and not has_inf,
            message="No NaN or Inf values" if not (has_nan or has_inf) else f"Found NaN={has_nan}, Inf={has_inf}"
        )

Running Model Tests

Output

============================================================
MODEL TEST REPORT
============================================================
Tests: 5/5 passed

  ✓ accuracy_threshold: Accuracy 0.5625 >= 0.4
      Actual: 0.5625, Threshold: 0.4
  ✓ latency_p95: P95 latency 0.11ms <= 100ms
      Actual: 0.1125, Threshold: 100
  ✓ memory_usage: Memory test skipped (CPU mode)
  ✓ output_shape: Output shape (4, 2) matches expected
  ✓ no_nan_outputs: No NaN or Inf values in output

ALL TESTS PASSED

Regression Tests

Compare against a baseline model to catch regressions:

def test_no_regression(
    self,
    new_model,
    baseline_model,
    dataloader,
    max_degradation: float = 0.02
) -> TestResult:
    """Test that new model isn't worse than baseline."""
    new_acc = self._evaluate_accuracy(new_model, dataloader)
    baseline_acc = self._evaluate_accuracy(baseline_model, dataloader)

    degradation = baseline_acc - new_acc

    return TestResult(
        name="regression_test",
        passed=degradation <= max_degradation,
        message=f"Degradation {degradation:.4f} {'≤' if degradation <= max_degradation else '>'} {max_degradation}",
        actual_value=degradation,
        threshold=max_degradation
    )

Part 3: GitHub Actions Workflow

Here’s the complete pipeline configuration:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main, develop]
    paths:
      - 'src/**'
      - 'data/**'
      - 'tests/**'
  pull_request:
    branches: [main]

env:
  PYTHON_VERSION: '3.10'

jobs:
  # ========================================
  # Stage 1: Code Quality
  # ========================================
  code-quality:
    name: Code Quality
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: |
          pip install ruff mypy pytest
          pip install -r requirements.txt

      - name: Lint with ruff
        run: ruff check src/ tests/

      - name: Type check
        run: mypy src/ --ignore-missing-imports

      - name: Unit tests
        run: pytest tests/unit/ -v

  # ========================================
  # Stage 2: Data Validation
  # ========================================
  data-validation:
    name: Data Validation
    runs-on: ubuntu-latest
    needs: code-quality
    outputs:
      data_hash: ${{ steps.validate.outputs.data_hash }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install pandas

      - name: Validate training data
        id: validate
        run: |
          python scripts/data_validation.py data/train.csv \
            --columns text label \
            --label label \
            --min-samples 1000 \
            --output validation_report.json

          DATA_HASH=$(cat validation_report.json | jq -r '.dataset_hash')
          echo "data_hash=$DATA_HASH" >> $GITHUB_OUTPUT

      - name: Upload validation report
        uses: actions/upload-artifact@v4
        with:
          name: data-validation-report
          path: validation_report.json

  # ========================================
  # Stage 3: Model Training (GPU)
  # ========================================
  train:
    name: Train Model
    runs-on: [self-hosted, gpu]
    needs: data-validation
    steps:
      - uses: actions/checkout@v4

      - name: Check GPU
        run: nvidia-smi

      - name: Train model
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
        run: |
          python src/train.py \
            --data data/train.csv \
            --epochs 3 \
            --output-dir outputs/

      - name: Upload model
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: outputs/model/

  # ========================================
  # Stage 4: Model Testing (GPU)
  # ========================================
  test-model:
    name: Test Model
    runs-on: [self-hosted, gpu]
    needs: train
    steps:
      - uses: actions/checkout@v4

      - name: Download model
        uses: actions/download-artifact@v4
        with:
          name: trained-model
          path: outputs/model/

      - name: Run model tests
        run: |
          python scripts/model_testing.py \
            --model-path outputs/model/ \
            --min-accuracy 0.80 \
            --max-latency-ms 50 \
            --output test_report.json

      - name: Check thresholds
        run: |
          PASSED=$(cat test_report.json | jq '.passed')
          if [ "$PASSED" != "true" ]; then
            echo "Model tests failed!"
            exit 1
          fi

  # ========================================
  # Stage 5: Deploy to Staging
  # ========================================
  deploy-staging:
    name: Deploy Staging
    runs-on: ubuntu-latest
    needs: test-model
    if: github.ref == 'refs/heads/develop'
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Download model
        uses: actions/download-artifact@v4
        with:
          name: trained-model
          path: outputs/model/

      - name: Deploy to SageMaker
        run: |
          python scripts/deploy_sagemaker.py \
            --model-path outputs/model/ \
            --endpoint sentiment-staging

  # ========================================
  # Stage 6: Deploy to Production
  # ========================================
  deploy-production:
    name: Deploy Production
    runs-on: ubuntu-latest
    needs: test-model
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Download model
        uses: actions/download-artifact@v4
        with:
          name: trained-model
          path: outputs/model/

      - name: Blue-green deployment
        run: |
          python scripts/deploy_sagemaker.py \
            --model-path outputs/model/ \
            --endpoint sentiment-prod \
            --blue-green

      - name: Canary tests
        run: |
          python scripts/canary_tests.py \
            --endpoint sentiment-prod \
            --duration 300

Part 4: Self-Hosted GPU Runners

For GPU training in CI/CD, set up a self-hosted runner:

EC2 Runner Setup

# On your GPU instance (g5.xlarge or similar)

# 1. Install runner
mkdir actions-runner && cd actions-runner
curl -o actions-runner.tar.gz -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz
tar xzf actions-runner.tar.gz

# 2. Configure (get token from GitHub repo settings)
./config.sh --url https://github.com/YOUR_ORG/YOUR_REPO --token YOUR_TOKEN

# 3. Add GPU label
./config.sh --labels gpu,cuda

# 4. Install as service
sudo ./svc.sh install
sudo ./svc.sh start

Runner Labels

Use labels to route jobs to appropriate runners:

jobs:
  train:
    runs-on: [self-hosted, gpu]  # Routes to GPU runner

  test:
    runs-on: ubuntu-latest  # Uses GitHub-hosted runner

Part 5: Deployment Strategies

Blue-Green Deployment

Deploy new version alongside old, then switch traffic:

# scripts/deploy_sagemaker.py
import boto3
import time

def blue_green_deploy(model_path: str, endpoint_name: str):
    sm = boto3.client('sagemaker')

    # 1. Create new model
    new_model_name = f"{endpoint_name}-{int(time.time())}"
    sm.create_model(
        ModelName=new_model_name,
        PrimaryContainer={
            'Image': 'pytorch-inference:2.0-gpu',
            'ModelDataUrl': f's3://bucket/{model_path}'
        },
        ExecutionRoleArn='arn:aws:iam::...'
    )

    # 2. Create new endpoint config
    new_config_name = f"{endpoint_name}-config-{int(time.time())}"
    sm.create_endpoint_config(
        EndpointConfigName=new_config_name,
        ProductionVariants=[{
            'VariantName': 'AllTraffic',
            'ModelName': new_model_name,
            'InstanceType': 'ml.g5.xlarge',
            'InitialInstanceCount': 1
        }]
    )

    # 3. Update endpoint (zero-downtime)
    sm.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=new_config_name
    )

    # 4. Wait for deployment
    waiter = sm.get_waiter('endpoint_in_service')
    waiter.wait(EndpointName=endpoint_name)

    print(f"Deployed {new_model_name} to {endpoint_name}")

Canary Testing

Test new deployment with a subset of traffic:

def canary_test(endpoint_name: str, duration_seconds: int = 300, error_threshold: float = 0.01):
    """Run canary tests against deployed endpoint."""
    sm_runtime = boto3.client('sagemaker-runtime')

    test_cases = load_test_cases()
    errors = 0
    total = 0

    start_time = time.time()
    while time.time() - start_time < duration_seconds:
        for test in test_cases:
            try:
                response = sm_runtime.invoke_endpoint(
                    EndpointName=endpoint_name,
                    ContentType='application/json',
                    Body=json.dumps(test['input'])
                )
                result = json.loads(response['Body'].read())

                # Validate response
                if not validate_response(result, test['expected']):
                    errors += 1

            except Exception as e:
                errors += 1
                print(f"Error: {e}")

            total += 1

        time.sleep(1)

    error_rate = errors / total
    print(f"Canary test: {errors}/{total} errors ({error_rate:.2%})")

    if error_rate > error_threshold:
        raise Exception(f"Error rate {error_rate:.2%} exceeds threshold {error_threshold:.2%}")

Part 6: Automated Retraining

Trigger retraining when performance degrades:

# .github/workflows/scheduled-retrain.yml
name: Scheduled Retrain

on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:
    inputs:
      force_retrain:
        description: 'Force retraining even if metrics are good'
        default: 'false'

jobs:
  check-metrics:
    runs-on: ubuntu-latest
    outputs:
      should_retrain: ${{ steps.check.outputs.should_retrain }}
    steps:
      - name: Check production metrics
        id: check
        run: |
          # Query CloudWatch for model accuracy
          ACCURACY=$(aws cloudwatch get-metric-data ...)

          if [ $(echo "$ACCURACY < 0.80" | bc) -eq 1 ]; then
            echo "should_retrain=true" >> $GITHUB_OUTPUT
          else
            echo "should_retrain=false" >> $GITHUB_OUTPUT
          fi

  retrain:
    needs: check-metrics
    if: needs.check-metrics.outputs.should_retrain == 'true' || github.event.inputs.force_retrain == 'true'
    uses: ./.github/workflows/ml-pipeline.yml

Best Practices

1. Fail Fast

Put cheap checks first:

jobs:
  lint:          # 30 seconds
    ...
  unit-tests:    # 2 minutes
    needs: lint
  data-validate: # 1 minute
    needs: unit-tests
  train:         # 30 minutes
    needs: data-validate

2. Cache Aggressively

- name: Cache pip packages
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: pip-${{ hashFiles('requirements.txt') }}

- name: Cache model weights
  uses: actions/cache@v4
  with:
    path: ~/.cache/huggingface
    key: hf-${{ hashFiles('src/config.py') }}

3. Artifact Management

- name: Upload artifacts
  uses: actions/upload-artifact@v4
  with:
    name: model-${{ github.sha }}
    path: outputs/
    retention-days: 30

4. Secrets Management

Never hardcode credentials:

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

5. Environment Protection

Require approvals for production:

deploy-production:
  environment:
    name: production
    url: https://api.example.com
  # Requires manual approval in GitHub UI

Full Code

All code from this tutorial is available at:

GitHub: largo-tutorials/ml-cicd

Key Takeaways

Validate data first - Bad data causes most ML failures
Test models, not just code - Accuracy, latency, memory thresholds
Use self-hosted GPU runners - GitHub-hosted runners don’t have GPUs
Deploy incrementally - Blue-green + canary catches issues early
Automate retraining - Trigger on metric degradation
Cache everything - Model weights, pip packages, data

What’s Next

This tutorial is part of the Senior MLE Guide series:

GPU Sizing for ML Workloads
Experiment Tracking with MLflow & Langfuse
CI/CD for Machine Learning ← You are here
Model Serving on AWS (coming soon)
ML Monitoring & Drift Detection (coming soon)
Security for ML Systems (coming soon)