Prompt Versioning and Lifecycle Management
Prompt Versioning and Lifecycle Management
Introduction
Here’s a situation you’ll face in production: Your chatbot’s system prompt is working well, but a new model version comes out with better instruction-following. You update the prompt to take advantage of new features. But what if the new version breaks something? You have no way to rollback. You don’t know when the problem started or which prompt version caused it.
This lesson teaches you how to treat prompts like code: version them, track changes, test before deployment, and monitor production performance. Without these practices, you’ll spend hours debugging “why did this suddenly start giving wrong answers?” without any way to find out.
Key Takeaway: Prompts are code. They should be version controlled, tested in CI/CD pipelines, and monitored in production. A prompt change that breaks 5% of requests might not be caught in manual testing but will be revealed by production metrics.
Why Prompt Versioning Matters
Let’s start with concrete pain points you’ll experience without versioning:
Problem 1: Unknown Regression
You deploy a prompt update. A week later, users complain about lower response quality. You ask: “Which prompt are we running?” Nobody knows exactly. You have a vague idea it was updated but no way to rollback.
Problem 2: Experimentation Gone Wrong
You test 5 different prompt variations. One gets deployed to production by mistake instead of the one you intended. Now it’s live for hours before anyone notices.
Problem 3: Multi-Team Confusion
Your team has 12 different prompts in use. Different developers are updating them independently. Nobody knows which versions are in which environments (dev, staging, production).
Problem 4: Debugging Without History
A user reports an issue with a specific response. You want to know: what exact prompt generated this response? But there’s no way to look it up.
Version control solves all of these problems.
Git-Based Prompt Management
The simplest and best approach is to treat prompts like code and put them in Git:
Directory Structure
project/
├── prompts/
│ ├── customer-support/
│ │ ├── system-prompt.txt
│ │ └── .prompt-metadata.json
│ ├── content-generation/
│ │ ├── system-prompt.txt
│ │ └── .prompt-metadata.json
│ └── data-extraction/
│ ├── system-prompt.txt
│ └── .prompt-metadata.json
├── tests/
│ ├── test_customer_support_prompt.py
│ └── test_extraction_prompt.py
├── .git/
└── README.md
Metadata File
Each prompt should have a metadata file tracking its purpose and configuration:
{
"name": "customer-support-system",
"version": "1.3.2",
"description": "System prompt for customer support chatbot",
"created": "2024-03-01T10:00:00Z",
"updated": "2024-03-19T14:30:00Z",
"author": "sarah.chen@company.com",
"model": "claude-3-sonnet",
"context_window_used": 8000,
"estimated_cost_per_1k": 0.003,
"performance_metrics": {
"accuracy": 0.94,
"user_satisfaction": 4.7,
"average_response_time_ms": 1200
},
"change_notes": "Updated tone to be more empathetic, added fallback for out-of-scope questions",
"tags": ["production", "active", "customer-facing"],
"previous_version": "1.3.1",
"test_status": "passed"
}
Git Commit Workflow
Each prompt change is a Git commit with semantic versioning:
# Example commits
# Bug fix (patch version)
git commit -m "fix: correct tone in escalation messages (1.3.1 -> 1.3.2)"
# New feature (minor version)
git commit -m "feat: add product recommendation capability (1.3.2 -> 1.4.0)"
# Breaking change (major version)
git commit -m "BREAKING: restructure output format for API compatibility (1.4.0 -> 2.0.0)"
Each commit should include:
- The changed prompt
- Updated metadata
- Test results
- Performance impact (if known)
def log_prompt_change(prompt_id: str, new_version: str, change_description: str):
"""Log a prompt change with git"""
import subprocess
import json
from datetime import datetime
# Update metadata
metadata_file = f"prompts/{prompt_id}/.prompt-metadata.json"
with open(metadata_file, 'r') as f:
metadata = json.load(f)
old_version = metadata['version']
metadata['version'] = new_version
metadata['updated'] = datetime.now().isoformat()
metadata['change_notes'] = change_description
with open(metadata_file, 'w') as f:
json.dump(metadata, f, indent=2)
# Commit to git
prompt_file = f"prompts/{prompt_id}/system-prompt.txt"
subprocess.run(['git', 'add', prompt_file, metadata_file])
message = f"prompt({prompt_id}): {old_version} -> {new_version}: {change_description}"
subprocess.run(['git', 'commit', '-m', message])
print(f"Logged change: {message}")
Prompt Registries and Catalogs
As you accumulate more prompts, you need a centralized registry to track what exists and where it’s deployed:
Simple Registry Implementation
import json
import os
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import Dict, List
@dataclass
class PromptRegistry:
"""Central registry of all prompts"""
def __init__(self, registry_file: str = "prompts/registry.json"):
self.registry_file = registry_file
self.prompts: Dict = self._load_registry()
def _load_registry(self) -> Dict:
"""Load registry from disk"""
if os.path.exists(self.registry_file):
with open(self.registry_file, 'r') as f:
return json.load(f)
return {}
def register(self, prompt_id: str, metadata: dict):
"""Register a new prompt or update existing"""
self.prompts[prompt_id] = {
**metadata,
'registered_at': datetime.now().isoformat()
}
self._save_registry()
def get(self, prompt_id: str) -> dict:
"""Retrieve prompt metadata"""
return self.prompts.get(prompt_id)
def list_all(self) -> List[str]:
"""List all registered prompt IDs"""
return list(self.prompts.keys())
def list_by_environment(self, environment: str) -> List[str]:
"""Find all prompts in an environment (dev/staging/prod)"""
return [
prompt_id
for prompt_id, metadata in self.prompts.items()
if metadata.get('environment') == environment
]
def deploy(self, prompt_id: str, version: str, environment: str):
"""Record that a specific version is deployed to an environment"""
if prompt_id not in self.prompts:
raise ValueError(f"Unknown prompt: {prompt_id}")
self.prompts[prompt_id]['deployments'] = \
self.prompts[prompt_id].get('deployments', {})
self.prompts[prompt_id]['deployments'][environment] = {
'version': version,
'deployed_at': datetime.now().isoformat()
}
self._save_registry()
def get_deployed_version(self, prompt_id: str, environment: str) -> str:
"""Get the currently deployed version for an environment"""
deployments = self.prompts[prompt_id].get('deployments', {})
if environment in deployments:
return deployments[environment]['version']
return None
def _save_registry(self):
"""Save registry to disk"""
with open(self.registry_file, 'w') as f:
json.dump(self.prompts, f, indent=2)
# Usage
registry = PromptRegistry()
registry.register('customer-support', {
'name': 'Customer Support',
'current_version': '1.3.2',
'model': 'claude-3-sonnet',
'description': 'Handles customer inquiries',
'maintainer': 'support-team@company.com'
})
registry.deploy('customer-support', '1.3.2', 'production')
registry.deploy('customer-support', '1.4.0-rc1', 'staging')
# Query
prod_version = registry.get_deployed_version('customer-support', 'production')
print(f"Production running: v{prod_version}")
staging_prompts = registry.list_by_environment('staging')
print(f"Staging has {len(staging_prompts)} prompts")
CI/CD for Prompts
Your prompt changes should go through the same testing rigor as code changes:
GitHub Actions Example
# .github/workflows/test-prompts.yml
name: Test Prompt Changes
on:
pull_request:
paths:
- 'prompts/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Detect changed prompts
id: changes
run: |
git fetch origin main
echo "changed_prompts=$(git diff --name-only origin/main HEAD | grep prompts/ | cut -d'/' -f2 | sort -u)" >> $GITHUB_OUTPUT
- name: Run unit tests
run: pytest tests/ -v
- name: Run prompt validation
run: python scripts/validate_prompts.py
- name: Test against models
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python tests/test_prompts.py
- name: Performance regression test
run: python tests/test_performance.py
continue-on-error: true # Warn but don't fail
- name: Generate report
if: always()
run: python scripts/generate_test_report.py
Validation Script
# scripts/validate_prompts.py
import os
import json
import sys
def validate_prompt_structure(prompt_path: str) -> bool:
"""Validate that a prompt has correct structure"""
errors = []
# Check required files
required_files = ['system-prompt.txt', '.prompt-metadata.json']
for required_file in required_files:
full_path = os.path.join(prompt_path, required_file)
if not os.path.exists(full_path):
errors.append(f"Missing {required_file}")
# Check metadata validity
metadata_path = os.path.join(prompt_path, '.prompt-metadata.json')
try:
with open(metadata_path, 'r') as f:
metadata = json.load(f)
required_fields = ['name', 'version', 'model', 'description']
for field in required_fields:
if field not in metadata:
errors.append(f"Missing metadata field: {field}")
except json.JSONDecodeError:
errors.append("Invalid JSON in .prompt-metadata.json")
# Check prompt length
system_prompt_path = os.path.join(prompt_path, 'system-prompt.txt')
with open(system_prompt_path, 'r') as f:
prompt_text = f.read()
if len(prompt_text) < 50:
errors.append("System prompt is too short (< 50 chars)")
if len(prompt_text) > 50000:
errors.append("System prompt is too long (> 50k chars)")
if errors:
print(f"Validation failed for {prompt_path}:")
for error in errors:
print(f" - {error}")
return False
return True
# Validate all prompts
all_valid = True
for prompt_dir in os.listdir('prompts'):
prompt_path = os.path.join('prompts', prompt_dir)
if os.path.isdir(prompt_path):
if not validate_prompt_structure(prompt_path):
all_valid = False
sys.exit(0 if all_valid else 1)
Monitoring Production Performance
Once deployed, prompts need monitoring to catch performance regressions:
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class PromptMetrics:
"""Track performance metrics for a deployed prompt"""
prompt_id: str
version: str
environment: str
timestamp: datetime
latency_ms: float
tokens_generated: int
user_satisfaction: float
error: bool = False
class PromptMonitor:
"""Monitor prompt performance in production"""
def __init__(self):
self.metrics: List[PromptMetrics] = []
def record_call(self,
prompt_id: str,
version: str,
environment: str,
latency_ms: float,
tokens: int,
satisfaction: float = None,
error: bool = False):
"""Record a prompt execution"""
metric = PromptMetrics(
prompt_id=prompt_id,
version=version,
environment=environment,
timestamp=datetime.now(),
latency_ms=latency_ms,
tokens_generated=tokens,
user_satisfaction=satisfaction or 0.0,
error=error
)
self.metrics.append(metric)
def get_stats(self, prompt_id: str, hours: int = 1) -> Dict:
"""Get statistics for last N hours"""
cutoff_time = datetime.now() - timedelta(hours=hours)
relevant = [
m for m in self.metrics
if m.prompt_id == prompt_id and m.timestamp > cutoff_time
]
if not relevant:
return None
latencies = [m.latency_ms for m in relevant]
satisfactions = [m.user_satisfaction for m in relevant if m.user_satisfaction > 0]
error_count = sum(1 for m in relevant if m.error)
return {
'total_calls': len(relevant),
'avg_latency_ms': sum(latencies) / len(latencies),
'p95_latency_ms': sorted(latencies)[int(0.95 * len(latencies))],
'avg_satisfaction': sum(satisfactions) / len(satisfactions) if satisfactions else None,
'error_rate': error_count / len(relevant),
'total_tokens': sum(m.tokens_generated for m in relevant)
}
def detect_regression(self, prompt_id: str, baseline_hours: int = 24) -> bool:
"""Detect if current performance has degraded from baseline"""
current_stats = self.get_stats(prompt_id, hours=1)
baseline_stats = self.get_stats(prompt_id, hours=baseline_hours)
if not current_stats or not baseline_stats:
return False
# Check for significant error rate increase
if current_stats['error_rate'] > baseline_stats['error_rate'] * 1.5:
print(f"WARNING: Error rate increased for {prompt_id}")
return True
# Check for significant satisfaction drop
if (current_stats['avg_satisfaction'] and baseline_stats['avg_satisfaction']):
if current_stats['avg_satisfaction'] < baseline_stats['avg_satisfaction'] * 0.9:
print(f"WARNING: Satisfaction dropped for {prompt_id}")
return True
return False
# Usage
monitor = PromptMonitor()
# Simulate production calls
for i in range(100):
monitor.record_call(
prompt_id='customer-support',
version='1.3.2',
environment='production',
latency_ms=1250,
tokens=500,
satisfaction=4.6,
error=False
)
# Check stats
stats = monitor.get_stats('customer-support', hours=1)
print(f"Stats: {stats}")
# Check for regression
is_degraded = monitor.detect_regression('customer-support')
if is_degraded:
print("Alert: Performance degraded, consider rollback")
Deployment Strategies
Blue-Green Deployment
Run old and new prompts in parallel before fully switching:
class BlueGreenPrompDispatcher:
"""Route requests to old (blue) or new (green) prompt versions"""
def __init__(self, prompt_id: str, blue_version: str, green_version: str):
self.prompt_id = prompt_id
self.blue_version = blue_version
self.green_version = green_version
self.green_traffic_percent = 0 # Start with 0% to new version
def route_request(self, request_id: str) -> str:
"""Decide which version to use for this request"""
# Use request_id hash for consistency
import hashlib
hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
rand_percent = (hash_val % 100)
if rand_percent < self.green_traffic_percent:
return self.green_version
else:
return self.blue_version
def increase_green_traffic(self, percent: int):
"""Gradually increase traffic to new version"""
self.green_traffic_percent = min(100, percent)
print(f"Routing {percent}% traffic to green ({self.green_version})")
def rollback(self):
"""Rollback to blue version"""
self.green_traffic_percent = 0
print(f"Rolled back to blue ({self.blue_version})")
# Deployment plan
dispatcher = BlueGreenPrompDispatcher(
prompt_id='customer-support',
blue_version='1.3.2', # Current production
green_version='1.4.0' # New version
)
# Gradually shift traffic
dispatcher.increase_green_traffic(5) # 5% to new version
# Monitor for errors...
dispatcher.increase_green_traffic(25) # 25% to new version
# Monitor...
dispatcher.increase_green_traffic(50) # 50% to new version
# Full monitoring...
dispatcher.increase_green_traffic(100) # All traffic to new version
# If problems appear:
# dispatcher.rollback()
Canary Deployment
Deploy to a small subset of users first:
def is_canary_user(user_id: str, canary_percent: int = 5) -> bool:
"""Determine if user is in canary group"""
import hashlib
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return (hash_val % 100) < canary_percent
def get_prompt_version(user_id: str,
stable_version: str,
canary_version: str,
canary_percent: int = 5) -> str:
"""Route user to stable or canary prompt version"""
if is_canary_user(user_id, canary_percent):
return canary_version
else:
return stable_version
# Usage
user_prompt_version = get_prompt_version(
user_id='user_12345',
stable_version='1.3.2',
canary_version='1.4.0-rc1',
canary_percent=10
)
Exercise: Set Up Prompt Versioning
Implement a complete Git-based prompt versioning system:
- Create a directory structure with at least 2 prompts
- Create metadata JSON files for each
- Set up a prompt registry that tracks versions and deployments
- Create a simple Python script that:
- Validates all prompts have correct structure
- Updates version numbers when prompts change
- Logs changes to git with semantic versioning
- Create test cases that validate prompt quality
- Set up a GitHub Actions workflow (or equivalent) that:
- Validates prompts on PR
- Runs test suite
- Prevents merging if tests fail
Deliverables:
- A GitHub repo with prompt versioning structure
- At least 2 production-ready prompts with metadata
- A working registry showing current versions
- A validation script
- A CI/CD workflow file
- Documentation on how to deploy/rollback
Summary
In this lesson, you’ve learned:
- Why prompt versioning is critical for production systems
- Git-based workflows for managing prompts like code
- Semantic versioning for prompt changes
- Building a central prompt registry and catalog
- Setting up CI/CD pipelines for prompt testing
- Monitoring production performance to detect regressions
- Deployment strategies like blue-green and canary
- How to safely rollback when problems occur
Module 1 is complete. You now understand how to measure, test, optimize, and manage prompts systematically. Next, we move into Module 2: System Prompts and Behavioral Design.