CI/CD for Machine Learning
Build a complete ML pipeline with GitHub Actions: data validation, model training, automated testing, and staged deployment to production.
ML CI/CD is different from software CI/CD. You’re not just testing code—you’re testing data, models, and the interaction between them. A model that passes unit tests can still fail in production if the data distribution shifts or performance degrades below acceptable thresholds.
This tutorial builds a complete ML pipeline with GitHub Actions that validates data, trains models, runs performance tests, and deploys to production with blue-green rollouts.
The ML Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ ML CI/CD PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Code │ │ Data │ │ Model │ │ Model │ │
│ │ Quality │──▶│Validation│──▶│ Training │──▶│ Testing │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ - Lint - Schema - Train - Accuracy │
│ - Types - Nulls - Log to - Latency │
│ - Unit tests - Ranges - MLflow - Memory │
│ - Balance - Regression │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Integration│──▶│ Deploy │──▶│ Deploy │ │
│ │ Tests │ │ Staging │ │Production│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ - End-to-end - Smoke - Blue-green │
│ - API tests - tests - Canary │
│ - Rollback │
│ │
└─────────────────────────────────────────────────────────────────────┘
Each stage gates the next. If data validation fails, training doesn’t run. If model tests fail, deployment is blocked.
Part 1: Data Validation
Data issues are the #1 cause of ML failures in production. Catch them early with automated validation.
The Data Validator
# scripts/data_validation.py
import json
import sys
import hashlib
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from pathlib import Path
@dataclass
class ValidationResult:
"""Result of a validation check."""
name: str
passed: bool
message: str
details: Optional[Dict[str, Any]] = None
@dataclass
class DataValidationReport:
"""Complete validation report."""
dataset_path: str
dataset_hash: str
total_checks: int
passed_checks: int
failed_checks: int
results: List[ValidationResult]
@property
def passed(self) -> bool:
return self.failed_checks == 0
class DataValidator:
"""Validates ML datasets."""
def validate_schema(self, df, expected_columns: List[str]) -> ValidationResult:
"""Check that expected columns exist."""
missing = set(expected_columns) - set(df.columns)
if missing:
return ValidationResult(
name="schema_validation",
passed=False,
message=f"Missing columns: {missing}",
details={"missing": list(missing)}
)
return ValidationResult(
name="schema_validation",
passed=True,
message=f"All {len(expected_columns)} expected columns present"
)
def validate_no_nulls(self, df, columns: List[str]) -> ValidationResult:
"""Check for null values in specified columns."""
null_counts = df[columns].isnull().sum()
columns_with_nulls = null_counts[null_counts > 0]
if len(columns_with_nulls) > 0:
return ValidationResult(
name="null_check",
passed=False,
message=f"Found nulls in {len(columns_with_nulls)} columns",
details={"null_counts": columns_with_nulls.to_dict()}
)
return ValidationResult(
name="null_check",
passed=True,
message="No null values found in required columns"
)
def validate_class_balance(self, df, label_column: str, max_imbalance: float = 0.8) -> ValidationResult:
"""Check that classes aren't too imbalanced."""
class_counts = df[label_column].value_counts(normalize=True)
max_ratio = class_counts.max()
if max_ratio > max_imbalance:
return ValidationResult(
name="class_balance",
passed=False,
message=f"Class imbalance: {max_ratio:.2%} in majority class",
details={"distribution": class_counts.to_dict()}
)
return ValidationResult(
name="class_balance",
passed=True,
message=f"Class balance acceptable (max {max_ratio:.2%})"
)
def validate_min_samples(self, df, min_samples: int) -> ValidationResult:
"""Check minimum sample count."""
actual = len(df)
if actual < min_samples:
return ValidationResult(
name="min_samples",
passed=False,
message=f"Insufficient samples: {actual} < {min_samples}"
)
return ValidationResult(
name="min_samples",
passed=True,
message=f"Sample count OK: {actual} >= {min_samples}"
)
Running Data Validation
============================================================ DATA VALIDATION REPORT ============================================================ Dataset: data/train.csv Hash: a0b908d5959bc22c Checks: 4/4 passed ✓ schema_validation: All 2 expected columns present ✓ null_check: No null values found in required columns ✓ min_samples: Sample count OK: 1000 >= 500 ✓ class_balance: Class balance acceptable (max 51.00%) VALIDATION PASSED
The data hash (a0b908d5959bc22c) enables caching—if data hasn’t changed, we can skip re-training.
Part 2: Model Testing
Model tests verify that the model meets performance requirements before deployment.
The Model Tester
# scripts/model_testing.py
import time
import torch
import numpy as np
from dataclasses import dataclass
from typing import Dict, Any, Optional, List
@dataclass
class TestResult:
"""Result of a single test."""
name: str
passed: bool
message: str
actual_value: Optional[float] = None
threshold: Optional[float] = None
class ModelTester:
"""Tests ML models against requirements."""
def __init__(self, model, device: str = "cuda"):
self.model = model
self.device = device
self.model.to(device)
self.model.eval()
def test_accuracy_threshold(
self,
dataloader,
min_accuracy: float
) -> TestResult:
"""Test that model meets accuracy threshold."""
correct = 0
total = 0
with torch.no_grad():
for batch in dataloader:
inputs = batch["input_ids"].to(self.device)
attention_mask = batch["attention_mask"].to(self.device)
labels = batch["label"].to(self.device)
outputs = self.model(inputs, attention_mask)
_, predicted = outputs.max(1)
correct += (predicted == labels).sum().item()
total += labels.size(0)
accuracy = correct / total
return TestResult(
name="accuracy_threshold",
passed=accuracy >= min_accuracy,
message=f"Accuracy {accuracy:.4f} {'≥' if accuracy >= min_accuracy else '<'} {min_accuracy}",
actual_value=accuracy,
threshold=min_accuracy
)
def test_latency(
self,
sample_input: Dict[str, torch.Tensor],
max_latency_ms: float,
num_runs: int = 100
) -> TestResult:
"""Test inference latency."""
# Warmup
for _ in range(10):
with torch.no_grad():
self.model(
sample_input["input_ids"].to(self.device),
sample_input["attention_mask"].to(self.device)
)
# Measure
latencies = []
for _ in range(num_runs):
start = time.perf_counter()
with torch.no_grad():
self.model(
sample_input["input_ids"].to(self.device),
sample_input["attention_mask"].to(self.device)
)
if self.device == "cuda":
torch.cuda.synchronize()
latencies.append((time.perf_counter() - start) * 1000)
p95_latency = np.percentile(latencies, 95)
return TestResult(
name="latency_p95",
passed=p95_latency <= max_latency_ms,
message=f"P95 latency {p95_latency:.2f}ms {'≤' if p95_latency <= max_latency_ms else '>'} {max_latency_ms}ms",
actual_value=p95_latency,
threshold=max_latency_ms
)
def test_memory_usage(self, max_memory_gb: float) -> TestResult:
"""Test GPU memory usage."""
if self.device != "cuda":
return TestResult(
name="memory_usage",
passed=True,
message="Memory test skipped (CPU mode)"
)
torch.cuda.reset_peak_memory_stats()
# Run inference
dummy_input = torch.randint(0, 1000, (1, 128)).to(self.device)
dummy_mask = torch.ones(1, 128).to(self.device)
with torch.no_grad():
self.model(dummy_input, dummy_mask)
peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
return TestResult(
name="memory_usage",
passed=peak_memory_gb <= max_memory_gb,
message=f"Memory {peak_memory_gb:.2f}GB {'≤' if peak_memory_gb <= max_memory_gb else '>'} {max_memory_gb}GB",
actual_value=peak_memory_gb,
threshold=max_memory_gb
)
def test_no_nan_outputs(self, sample_input: Dict[str, torch.Tensor]) -> TestResult:
"""Test that model doesn't produce NaN outputs."""
with torch.no_grad():
output = self.model(
sample_input["input_ids"].to(self.device),
sample_input["attention_mask"].to(self.device)
)
has_nan = torch.isnan(output).any().item()
has_inf = torch.isinf(output).any().item()
return TestResult(
name="no_nan_outputs",
passed=not has_nan and not has_inf,
message="No NaN or Inf values" if not (has_nan or has_inf) else f"Found NaN={has_nan}, Inf={has_inf}"
)
Running Model Tests
============================================================ MODEL TEST REPORT ============================================================ Tests: 5/5 passed ✓ accuracy_threshold: Accuracy 0.5625 >= 0.4 Actual: 0.5625, Threshold: 0.4 ✓ latency_p95: P95 latency 0.11ms <= 100ms Actual: 0.1125, Threshold: 100 ✓ memory_usage: Memory test skipped (CPU mode) ✓ output_shape: Output shape (4, 2) matches expected ✓ no_nan_outputs: No NaN or Inf values in output ALL TESTS PASSED
Regression Tests
Compare against a baseline model to catch regressions:
def test_no_regression(
self,
new_model,
baseline_model,
dataloader,
max_degradation: float = 0.02
) -> TestResult:
"""Test that new model isn't worse than baseline."""
new_acc = self._evaluate_accuracy(new_model, dataloader)
baseline_acc = self._evaluate_accuracy(baseline_model, dataloader)
degradation = baseline_acc - new_acc
return TestResult(
name="regression_test",
passed=degradation <= max_degradation,
message=f"Degradation {degradation:.4f} {'≤' if degradation <= max_degradation else '>'} {max_degradation}",
actual_value=degradation,
threshold=max_degradation
)
Part 3: GitHub Actions Workflow
Here’s the complete pipeline configuration:
# .github/workflows/ml-pipeline.yml
name: ML Pipeline
on:
push:
branches: [main, develop]
paths:
- 'src/**'
- 'data/**'
- 'tests/**'
pull_request:
branches: [main]
env:
PYTHON_VERSION: '3.10'
jobs:
# ========================================
# Stage 1: Code Quality
# ========================================
code-quality:
name: Code Quality
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install dependencies
run: |
pip install ruff mypy pytest
pip install -r requirements.txt
- name: Lint with ruff
run: ruff check src/ tests/
- name: Type check
run: mypy src/ --ignore-missing-imports
- name: Unit tests
run: pytest tests/unit/ -v
# ========================================
# Stage 2: Data Validation
# ========================================
data-validation:
name: Data Validation
runs-on: ubuntu-latest
needs: code-quality
outputs:
data_hash: ${{ steps.validate.outputs.data_hash }}
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install dependencies
run: pip install pandas
- name: Validate training data
id: validate
run: |
python scripts/data_validation.py data/train.csv \
--columns text label \
--label label \
--min-samples 1000 \
--output validation_report.json
DATA_HASH=$(cat validation_report.json | jq -r '.dataset_hash')
echo "data_hash=$DATA_HASH" >> $GITHUB_OUTPUT
- name: Upload validation report
uses: actions/upload-artifact@v4
with:
name: data-validation-report
path: validation_report.json
# ========================================
# Stage 3: Model Training (GPU)
# ========================================
train:
name: Train Model
runs-on: [self-hosted, gpu]
needs: data-validation
steps:
- uses: actions/checkout@v4
- name: Check GPU
run: nvidia-smi
- name: Train model
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
run: |
python src/train.py \
--data data/train.csv \
--epochs 3 \
--output-dir outputs/
- name: Upload model
uses: actions/upload-artifact@v4
with:
name: trained-model
path: outputs/model/
# ========================================
# Stage 4: Model Testing (GPU)
# ========================================
test-model:
name: Test Model
runs-on: [self-hosted, gpu]
needs: train
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: trained-model
path: outputs/model/
- name: Run model tests
run: |
python scripts/model_testing.py \
--model-path outputs/model/ \
--min-accuracy 0.80 \
--max-latency-ms 50 \
--output test_report.json
- name: Check thresholds
run: |
PASSED=$(cat test_report.json | jq '.passed')
if [ "$PASSED" != "true" ]; then
echo "Model tests failed!"
exit 1
fi
# ========================================
# Stage 5: Deploy to Staging
# ========================================
deploy-staging:
name: Deploy Staging
runs-on: ubuntu-latest
needs: test-model
if: github.ref == 'refs/heads/develop'
environment: staging
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: trained-model
path: outputs/model/
- name: Deploy to SageMaker
run: |
python scripts/deploy_sagemaker.py \
--model-path outputs/model/ \
--endpoint sentiment-staging
# ========================================
# Stage 6: Deploy to Production
# ========================================
deploy-production:
name: Deploy Production
runs-on: ubuntu-latest
needs: test-model
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- name: Download model
uses: actions/download-artifact@v4
with:
name: trained-model
path: outputs/model/
- name: Blue-green deployment
run: |
python scripts/deploy_sagemaker.py \
--model-path outputs/model/ \
--endpoint sentiment-prod \
--blue-green
- name: Canary tests
run: |
python scripts/canary_tests.py \
--endpoint sentiment-prod \
--duration 300
Part 4: Self-Hosted GPU Runners
For GPU training in CI/CD, set up a self-hosted runner:
EC2 Runner Setup
# On your GPU instance (g5.xlarge or similar)
# 1. Install runner
mkdir actions-runner && cd actions-runner
curl -o actions-runner.tar.gz -L https://github.com/actions/runner/releases/download/v2.311.0/actions-runner-linux-x64-2.311.0.tar.gz
tar xzf actions-runner.tar.gz
# 2. Configure (get token from GitHub repo settings)
./config.sh --url https://github.com/YOUR_ORG/YOUR_REPO --token YOUR_TOKEN
# 3. Add GPU label
./config.sh --labels gpu,cuda
# 4. Install as service
sudo ./svc.sh install
sudo ./svc.sh start
Runner Labels
Use labels to route jobs to appropriate runners:
jobs:
train:
runs-on: [self-hosted, gpu] # Routes to GPU runner
test:
runs-on: ubuntu-latest # Uses GitHub-hosted runner
Part 5: Deployment Strategies
Blue-Green Deployment
Deploy new version alongside old, then switch traffic:
# scripts/deploy_sagemaker.py
import boto3
import time
def blue_green_deploy(model_path: str, endpoint_name: str):
sm = boto3.client('sagemaker')
# 1. Create new model
new_model_name = f"{endpoint_name}-{int(time.time())}"
sm.create_model(
ModelName=new_model_name,
PrimaryContainer={
'Image': 'pytorch-inference:2.0-gpu',
'ModelDataUrl': f's3://bucket/{model_path}'
},
ExecutionRoleArn='arn:aws:iam::...'
)
# 2. Create new endpoint config
new_config_name = f"{endpoint_name}-config-{int(time.time())}"
sm.create_endpoint_config(
EndpointConfigName=new_config_name,
ProductionVariants=[{
'VariantName': 'AllTraffic',
'ModelName': new_model_name,
'InstanceType': 'ml.g5.xlarge',
'InitialInstanceCount': 1
}]
)
# 3. Update endpoint (zero-downtime)
sm.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=new_config_name
)
# 4. Wait for deployment
waiter = sm.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)
print(f"Deployed {new_model_name} to {endpoint_name}")
Canary Testing
Test new deployment with a subset of traffic:
def canary_test(endpoint_name: str, duration_seconds: int = 300, error_threshold: float = 0.01):
"""Run canary tests against deployed endpoint."""
sm_runtime = boto3.client('sagemaker-runtime')
test_cases = load_test_cases()
errors = 0
total = 0
start_time = time.time()
while time.time() - start_time < duration_seconds:
for test in test_cases:
try:
response = sm_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='application/json',
Body=json.dumps(test['input'])
)
result = json.loads(response['Body'].read())
# Validate response
if not validate_response(result, test['expected']):
errors += 1
except Exception as e:
errors += 1
print(f"Error: {e}")
total += 1
time.sleep(1)
error_rate = errors / total
print(f"Canary test: {errors}/{total} errors ({error_rate:.2%})")
if error_rate > error_threshold:
raise Exception(f"Error rate {error_rate:.2%} exceeds threshold {error_threshold:.2%}")
Part 6: Automated Retraining
Trigger retraining when performance degrades:
# .github/workflows/scheduled-retrain.yml
name: Scheduled Retrain
on:
schedule:
- cron: '0 0 * * 0' # Weekly on Sunday
workflow_dispatch:
inputs:
force_retrain:
description: 'Force retraining even if metrics are good'
default: 'false'
jobs:
check-metrics:
runs-on: ubuntu-latest
outputs:
should_retrain: ${{ steps.check.outputs.should_retrain }}
steps:
- name: Check production metrics
id: check
run: |
# Query CloudWatch for model accuracy
ACCURACY=$(aws cloudwatch get-metric-data ...)
if [ $(echo "$ACCURACY < 0.80" | bc) -eq 1 ]; then
echo "should_retrain=true" >> $GITHUB_OUTPUT
else
echo "should_retrain=false" >> $GITHUB_OUTPUT
fi
retrain:
needs: check-metrics
if: needs.check-metrics.outputs.should_retrain == 'true' || github.event.inputs.force_retrain == 'true'
uses: ./.github/workflows/ml-pipeline.yml
Best Practices
1. Fail Fast
Put cheap checks first:
jobs:
lint: # 30 seconds
...
unit-tests: # 2 minutes
needs: lint
data-validate: # 1 minute
needs: unit-tests
train: # 30 minutes
needs: data-validate
2. Cache Aggressively
- name: Cache pip packages
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('requirements.txt') }}
- name: Cache model weights
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
key: hf-${{ hashFiles('src/config.py') }}
3. Artifact Management
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: model-${{ github.sha }}
path: outputs/
retention-days: 30
4. Secrets Management
Never hardcode credentials:
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
5. Environment Protection
Require approvals for production:
deploy-production:
environment:
name: production
url: https://api.example.com
# Requires manual approval in GitHub UI
Full Code
All code from this tutorial is available at:
Key Takeaways
- Validate data first - Bad data causes most ML failures
- Test models, not just code - Accuracy, latency, memory thresholds
- Use self-hosted GPU runners - GitHub-hosted runners don’t have GPUs
- Deploy incrementally - Blue-green + canary catches issues early
- Automate retraining - Trigger on metric degradation
- Cache everything - Model weights, pip packages, data
What’s Next
This tutorial is part of the Senior MLE Guide series:
- GPU Sizing for ML Workloads
- Experiment Tracking with MLflow & Langfuse
- CI/CD for Machine Learning ← You are here
- Model Serving on AWS (coming soon)
- ML Monitoring & Drift Detection (coming soon)
- Security for ML Systems (coming soon)
Comments
to join the discussion.