MLOps and Model Management

Production ML systems require experiment tracking, model versioning, continuous integration, and monitoring. This lesson covers Weights & Biases, MLflow, model registries, and CI/CD pipelines for ML.

Core Concepts

Experiment Tracking

Log hyperparameters, metrics, artifacts for reproducibility:

Experiment 1: lr=2e-5, batch_size=32 → val_loss=2.15, accuracy=0.78
Experiment 2: lr=1e-5, batch_size=32 → val_loss=2.08, accuracy=0.79

Model Versioning

Track model versions with metadata:

Model v1.0: gpt2-base, val_loss=2.15
Model v1.1: gpt2-base + LoRA, val_loss=2.08
Model v2.0: gpt2-medium, val_loss=1.95

Model Registry

Central repository for production models:

Registry:
  - Model A v1.0 (dev)
  - Model A v1.1 (staging)
  - Model A v1.0 (production)
  - Model B v2.0 (canary, 10% traffic)

CI/CD for ML

Automate testing, evaluation, and deployment:

Push code → Run tests → Train model → Evaluate → Deploy if passing

Practical Implementation

Weights & Biases Tracking

import wandb
from transformers import TrainingArguments, Trainer

# Initialize project
wandb.init(project="llm-training", entity="my-org")

# Log config
wandb.config.update({
    "model": "gpt2",
    "learning_rate": 2e-5,
    "batch_size": 32,
})

# Training arguments with W&B integration
training_args = TrainingArguments(
    output_dir="./results",
    report_to=["wandb"],
    logging_dir="./logs",
    num_train_epochs=3,
)

# Trainer automatically logs
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

# Log final metrics
wandb.log({
    "final_accuracy": 0.95,
    "final_loss": 0.15,
})

wandb.finish()

MLflow Experiment Management

import mlflow
from mlflow.models import infer_signature

# Start experiment
with mlflow.start_run(run_name="gpt2-training"):
    # Log params
    mlflow.log_param("learning_rate", 2e-5)
    mlflow.log_param("batch_size", 32)
    
    # Train model...
    
    # Log metrics
    mlflow.log_metric("train_loss", 2.1)
    mlflow.log_metric("val_accuracy", 0.95)
    
    # Log model
    mlflow.pytorch.log_model(model, "model")
    
    # Create model signature
    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.models.log_model(model, "model", signature=signature)

Model Registry

# Register model
result = mlflow.register_model("runs:/abc123/model", "GPT2-classifier")

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="GPT2-classifier",
    version=result.version,
    stage="Production",
)

# Load production model
prod_model = mlflow.pyfunc.load_model("models:/GPT2-classifier/Production")

CI/CD Pipeline

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on: [push]

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v3
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run tests
        run: pytest tests/
      
      - name: Train model
        run: python train.py
      
      - name: Evaluate model
        run: python evaluate.py
      
      - name: Deploy if tests pass
        if: success()
        run: python deploy.py
      
      - name: Log to MLflow
        run: python log_metrics.py

Advanced Techniques

A/B Testing Infrastructure

class ABTestManager:
    def __init__(self, registry):
        self.registry = registry
        self.traffic_split = {"model_a": 0.9, "model_b": 0.1}
    
    def route_request(self, request):
        model_choice = np.random.choice(
            list(self.traffic_split.keys()),
            p=list(self.traffic_split.values())
        )
        
        model = self.registry.load(model_choice)
        output = model.predict(request)
        
        return output, {"model": model_choice}

Monitoring and Alerts

import prometheus_client

class ModelMonitor:
    def __init__(self):
        self.request_count = prometheus_client.Counter("requests", "Total requests")
        self.latency = prometheus_client.Histogram("latency_seconds", "Request latency")
        self.accuracy = prometheus_client.Gauge("accuracy", "Model accuracy")
    
    def predict(self, input_data):
        with self.latency.time():
            output = self.model.predict(input_data)
        
        self.request_count.inc()
        return output
    
    def update_accuracy(self, metrics):
        self.accuracy.set(metrics["accuracy"])

Production Considerations

Model Serving Architecture

Client requests → Load Balancer → Model Serving Container
                                  ├─ Model A (80%)
                                  └─ Model B (20% canary)
                                  
All metrics → Monitoring System (Prometheus)
           → Alerting System (PagerDuty)
           → Logging System (ELK Stack)

Key Takeaway

MLOps infrastructure tracks experiments, versions models, and automates deployment. Implement early with W&B or MLflow to enable reproducibility, collaboration, and safe production rollouts.

Practical Exercise

Task: Set up complete MLOps pipeline with experiment tracking and CI/CD.

Requirements:

Initialize W&B or MLflow project
Log experiments with metrics
Register best model to registry
Create GitHub Actions CI/CD
Set up monitoring and alerting

Evaluation:

Experiment reproducibility
Successful CI/CD execution
Model versioning and tracking
Alert system validation
Team collaboration workflow