Systematic Prompt Testing and A/B Testing
Systematic Prompt Testing and A/B Testing
Introduction
You now understand what makes a good prompt. The next question is: how do you actually compare two prompts scientifically? Intuition isn’t good enough. You could try both prompts, get a few examples back, and pick whichever feels better. But that’s how you end up deploying a prompt that works on your cherry-picked examples but fails in production.
This lesson teaches you how to build test harnesses, create representative test sets, run A/B tests, and determine statistical significance. By the end, you’ll have a systematic way to answer the question: “Should I use prompt A or prompt B?”
Key Takeaway: Good prompts aren’t discovered by accident. They’re tested rigorously against holdout test sets, compared with proper statistical methods, and deployed only after proving they work better than alternatives.
The Problem with Casual Testing
Let’s start with what NOT to do:
# BAD: Casual testing without rigor
prompt_a = "Summarize this in one sentence: {text}"
prompt_b = "Create a concise one-sentence summary: {text}"
text = "The company reported record earnings today..."
response_a = model.generate(prompt_a.format(text=text))
response_b = model.generate(prompt_b.format(text=text))
if len(response_a) < len(response_b):
print("Prompt A is better because it's shorter!")
Why is this bad?
- One example is meaningless: Maybe prompt A just happened to get lucky on this one example
- No consistent metric: “Shorter” might not be what you actually care about
- No statistical rigor: You can’t claim one is better with N=1
- Sampling bias: You picked one example, probably without thinking about whether it’s representative
Here’s a better approach:
Building a Test Harness
A test harness automates the process of running prompts against test cases and collecting metrics:
from dataclasses import dataclass
from typing import List, Callable, Dict
import json
import time
@dataclass
class TestCase:
"""A single test case with input and expected output"""
test_id: str
input: str
expected_output: str
category: str = "general" # Optional: for stratified analysis
@dataclass
class TestResult:
"""Result of running one prompt on one test case"""
test_id: str
prompt_name: str
output: str
latency: float
metrics: Dict[str, float]
class PromptTestHarness:
"""Systematically test prompts against a fixed test set"""
def __init__(self, test_cases: List[TestCase]):
self.test_cases = test_cases
self.results = []
def run_prompt(self,
prompt_template: str,
model_fn: Callable,
prompt_name: str) -> List[TestResult]:
"""Run a prompt against all test cases"""
results = []
for test_case in self.test_cases:
# Prepare the prompt
prompt = prompt_template.format(input=test_case.input)
# Measure latency
start_time = time.time()
output = model_fn(prompt)
latency = time.time() - start_time
# Compute metrics
metrics = {
'exact_match': int(output.strip() == test_case.expected_output.strip()),
'contains_key_words': self._contains_key_words(
output, test_case.expected_output
),
'length_ratio': len(output.split()) / max(1, len(test_case.expected_output.split()))
}
result = TestResult(
test_id=test_case.test_id,
prompt_name=prompt_name,
output=output,
latency=latency,
metrics=metrics
)
results.append(result)
self.results.extend(results)
return results
def _contains_key_words(self, output: str, expected: str) -> float:
"""Simple metric: percentage of expected words in output"""
expected_words = set(expected.lower().split())
output_words = set(output.lower().split())
if not expected_words:
return 1.0
overlap = len(expected_words & output_words)
return overlap / len(expected_words)
def compare_prompts(self,
prompt_a: str,
prompt_b: str,
model_fn: Callable,
metric: str = 'exact_match') -> Dict:
"""Run A/B test comparing two prompts"""
results_a = self.run_prompt(prompt_a, model_fn, "Prompt A")
results_b = self.run_prompt(prompt_b, model_fn, "Prompt B")
# Extract metrics for comparison
scores_a = [r.metrics[metric] for r in results_a]
scores_b = [r.metrics[metric] for r in results_b]
return {
'prompt_a_avg': sum(scores_a) / len(scores_a),
'prompt_b_avg': sum(scores_b) / len(scores_b),
'prompt_a_results': results_a,
'prompt_b_results': results_b,
'difference': sum(scores_b) / len(scores_b) - sum(scores_a) / len(scores_a)
}
Creating Representative Test Datasets
Your test set is crucial. It determines whether your evaluation reflects real-world performance. Here’s how to build one:
1. Stratified Sampling
Make sure your test set covers different categories of inputs:
def build_stratified_test_set(examples: List[Dict],
category_field: str = 'category',
examples_per_category: int = 10) -> List[TestCase]:
"""Build a test set with balanced representation"""
# Group examples by category
by_category = {}
for example in examples:
cat = example.get(category_field, 'unknown')
if cat not in by_category:
by_category[cat] = []
by_category[cat].append(example)
# Sample from each category
test_cases = []
test_id = 0
for category, category_examples in by_category.items():
# Take up to examples_per_category from each
sampled = category_examples[:examples_per_category]
for example in sampled:
test_cases.append(TestCase(
test_id=f"test_{test_id:04d}",
input=example['input'],
expected_output=example['expected'],
category=category
))
test_id += 1
return test_cases
# Example usage
all_examples = [
{'input': 'What is 2+2?', 'expected': '4', 'category': 'math'},
{'input': 'Explain photosynthesis', 'expected': 'Plants use...', 'category': 'science'},
{'input': 'Who won the 2020 election?', 'expected': 'Joe Biden', 'category': 'trivia'},
# ... many more examples
]
test_set = build_stratified_test_set(all_examples, examples_per_category=5)
print(f"Test set: {len(test_set)} cases across {len(set(t.category for t in test_set))} categories")
2. Avoid Contamination
Never use examples from your training/prompt-development data in your test set:
def split_data(all_examples, test_fraction=0.2):
"""Split into train (for prompt development) and test (for evaluation)"""
import random
random.shuffle(all_examples)
split_point = int(len(all_examples) * (1 - test_fraction))
train_set = all_examples[:split_point]
test_set = all_examples[split_point:]
return train_set, test_set
# Good practice
train_examples, test_examples = split_data(all_examples, test_fraction=0.2)
# Develop prompts using train_examples
prompt_a = "Based on the pattern in these examples..."
prompt_b = "Let me analyze this input..."
# Evaluate on test_examples only
harness = PromptTestHarness(
[TestCase(f"test_{i}", ex['input'], ex['expected'])
for i, ex in enumerate(test_examples)]
)
3. Edge Cases and Adversarial Examples
Your test set should include hard cases, not just easy ones:
def add_edge_cases(test_cases: List[TestCase]) -> List[TestCase]:
"""Include hard cases that might break prompts"""
edge_cases = [
# Empty/minimal inputs
TestCase("edge_empty", "", "N/A", "edge"),
# Extremely long inputs
TestCase("edge_long", "A" * 5000, "Should handle long input", "edge"),
# Ambiguous/conflicting inputs
TestCase("edge_ambiguous",
"Is 50 degrees hot or cold?",
"It depends on context",
"edge"),
# Inputs similar to training data but slightly different
TestCase("edge_distribution_shift",
"The cat sat on the mat... the dog played with the ball",
"Should handle distribution shift",
"edge"),
# Adversarial/tricky inputs
TestCase("edge_adversarial",
"Count the number of 'e's in 'enterprise'",
"3",
"edge"),
]
return test_cases + edge_cases
Running A/B Tests with Statistical Rigor
Once you have a test set, you need proper statistical methods to compare prompts. Here’s why casual comparison fails:
If prompt A scores 8/10 and prompt B scores 7/10, is A really better? Maybe B just got unlucky on this specific test set. To know for sure, you need statistical significance.
The Paired T-Test Approach
When you have the same test cases for both prompts, use a paired t-test:
from scipy import stats
import numpy as np
def paired_ttest_comparison(results_a: List[float],
results_b: List[float],
alpha: float = 0.05) -> Dict:
"""Compare two prompts with statistical significance testing"""
# Ensure same test cases
assert len(results_a) == len(results_b), "Results must have same length"
# Calculate statistics
mean_a = np.mean(results_a)
mean_b = np.mean(results_b)
difference = mean_b - mean_a
# Run paired t-test
t_statistic, p_value = stats.ttest_rel(results_b, results_a)
# Interpret
is_significant = p_value < alpha
return {
'prompt_a_mean': mean_a,
'prompt_b_mean': mean_b,
'difference': difference,
'p_value': p_value,
'is_significant': is_significant,
'interpretation': (
f"Prompt B is {'statistically significantly' if is_significant else 'NOT'} "
f"better than A (p={p_value:.4f})"
)
}
# Example
scores_prompt_a = [0.9, 0.8, 0.7, 0.85, 0.92, 0.88, 0.91, 0.79, 0.84, 0.86]
scores_prompt_b = [0.92, 0.85, 0.74, 0.87, 0.94, 0.89, 0.93, 0.82, 0.86, 0.88]
comparison = paired_ttest_comparison(scores_prompt_a, scores_prompt_b)
print(comparison['interpretation'])
Understanding Statistical Significance
The p-value tells you: “If these two prompts were actually identical in quality, what’s the probability of seeing a difference this large by chance?”
- p < 0.05: This difference is statistically significant (less than 5% chance of getting this by random luck). You can confidently say B is better.
- p >= 0.05: This difference might just be random variation. You can’t confidently say one is better.
Minimum Sample Size
How many test cases do you need? Typically 20-50 per prompt for reliable results:
def estimate_required_sample_size(effect_size: float = 0.2,
alpha: float = 0.05,
power: float = 0.8) -> int:
"""Estimate minimum sample size for A/B test"""
from scipy.stats import t
# Using t-test power calculation
# This is a simplified version; use statsmodels for more accuracy
t_crit = t.ppf(1 - alpha/2, df=100)
z_beta = 0.84 # for power=0.80
n = ((2 * (t_crit + z_beta)**2) / (effect_size**2))
return int(n)
sample_size = estimate_required_sample_size(effect_size=0.2)
print(f"Need at least {sample_size} test cases per prompt")
Implementing a Complete Testing Workflow
Here’s a complete, production-ready example:
import json
from datetime import datetime
from pathlib import Path
class PromptTestingFramework:
"""End-to-end prompt testing with reporting"""
def __init__(self, test_set: List[TestCase], output_dir: str = "./test_results"):
self.test_set = test_set
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.results = {}
def test_prompt(self,
prompt_name: str,
prompt_template: str,
model_fn: Callable,
metric_functions: Dict[str, Callable]) -> Dict:
"""Test a single prompt and compute all metrics"""
results = []
test_timestamp = datetime.now().isoformat()
for test_case in self.test_set:
prompt = prompt_template.format(input=test_case.input)
# Generate output
start = time.time()
output = model_fn(prompt)
latency = time.time() - start
# Compute all metrics
metrics = {}
for metric_name, metric_fn in metric_functions.items():
try:
score = metric_fn(output, test_case.expected_output)
metrics[metric_name] = score
except Exception as e:
metrics[metric_name] = f"error: {str(e)}"
results.append({
'test_id': test_case.test_id,
'category': test_case.category,
'output': output,
'expected': test_case.expected_output,
'latency': latency,
'metrics': metrics
})
# Summary statistics
summary = {
'prompt_name': prompt_name,
'timestamp': test_timestamp,
'num_tests': len(results),
'avg_latency': np.mean([r['latency'] for r in results]),
'metrics_summary': {}
}
for metric_name in metric_functions.keys():
scores = [r['metrics'][metric_name] for r in results
if isinstance(r['metrics'][metric_name], (int, float))]
if scores:
summary['metrics_summary'][metric_name] = {
'mean': np.mean(scores),
'std': np.std(scores),
'min': min(scores),
'max': max(scores)
}
self.results[prompt_name] = {
'results': results,
'summary': summary
}
return summary
def compare_prompts(self, prompt_a: str, prompt_b: str) -> Dict:
"""Compare two prompts with statistical testing"""
if prompt_a not in self.results or prompt_b not in self.results:
raise ValueError("Both prompts must be tested first")
results_a = self.results[prompt_a]['results']
results_b = self.results[prompt_b]['results']
# Get scores for main metric (e.g., 'accuracy')
scores_a = [r['metrics']['accuracy'] for r in results_a
if isinstance(r['metrics']['accuracy'], (int, float))]
scores_b = [r['metrics']['accuracy'] for r in results_b
if isinstance(r['metrics']['accuracy'], (int, float))]
# Statistical test
comparison = paired_ttest_comparison(scores_a, scores_b)
# Per-category analysis
category_analysis = {}
for test_case in self.test_set:
cat = test_case.category
if cat not in category_analysis:
category_analysis[cat] = {'a': [], 'b': []}
# Find corresponding results
for r_a in results_a:
if r_a['test_id'] == test_case.test_id:
if isinstance(r_a['metrics']['accuracy'], (int, float)):
category_analysis[cat]['a'].append(r_a['metrics']['accuracy'])
for r_b in results_b:
if r_b['test_id'] == test_case.test_id:
if isinstance(r_b['metrics']['accuracy'], (int, float)):
category_analysis[cat]['b'].append(r_b['metrics']['accuracy'])
category_summary = {}
for cat, scores in category_analysis.items():
if scores['a'] and scores['b']:
category_summary[cat] = {
'a_mean': np.mean(scores['a']),
'b_mean': np.mean(scores['b']),
'b_better': np.mean(scores['b']) > np.mean(scores['a'])
}
return {
'overall_comparison': comparison,
'category_analysis': category_summary
}
def generate_report(self, output_file: str = "test_report.json"):
"""Generate JSON report of all tests"""
report = {
'timestamp': datetime.now().isoformat(),
'test_set_size': len(self.test_set),
'prompts_tested': list(self.results.keys()),
'summaries': {name: data['summary'] for name, data in self.results.items()}
}
output_path = self.output_dir / output_file
with open(output_path, 'w') as f:
json.dump(report, f, indent=2)
print(f"Report saved to {output_path}")
return report
# Usage example
def exact_match(output, expected):
return int(output.strip() == expected.strip())
def contains_key_info(output, expected):
words = set(expected.lower().split())
return sum(1 for w in words if w in output.lower()) / len(words)
framework = PromptTestingFramework(test_set)
# Test prompt A
framework.test_prompt(
"Prompt A",
"Answer briefly: {input}",
model_fn=my_model.generate,
metric_functions={
'accuracy': exact_match,
'relevance': contains_key_info
}
)
# Test prompt B
framework.test_prompt(
"Prompt B",
"Provide a concise answer: {input}",
model_fn=my_model.generate,
metric_functions={
'accuracy': exact_match,
'relevance': contains_key_info
}
)
# Compare
comparison = framework.compare_prompts("Prompt A", "Prompt B")
print(f"Result: {comparison['overall_comparison']['interpretation']}")
# Generate report
framework.generate_report()
Using Existing Tools
You don’t have to build everything from scratch. These tools provide testing infrastructure:
PromptFoo
A CLI tool for testing prompts:
# Define prompts and test cases in YAML
# prompts.yaml
- id: prompt-a
template: "Answer briefly: {input}"
- id: prompt-b
template: "Provide a concise answer: {input}"
# tests.yaml
- input: "What is 2+2?"
expected: "4"
- input: "Explain gravity"
expected: "Force that attracts objects with mass"
# Run comparison
promptfoo compare -p prompts.yaml -t tests.yaml
LangSmith
Langchain’s built-in evaluation and monitoring:
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
# Log prompt runs
client.create_run(
name="prompt-test",
inputs={"question": "What is 2+2?"},
outputs={"answer": "4"}
)
# Evaluate and track
results = evaluate(
lambda x: model.invoke(prompt.format(input=x['input'])),
data=dataset,
evaluators=[accuracy_evaluator, relevance_evaluator],
experiment_prefix="my-experiment"
)
Exercise: Build a Testing Pipeline
Build a complete A/B testing pipeline for a text classification task (classify reviews as positive/negative):
- Create 50 test cases (25 positive, 25 negative reviews)
- Write two different prompts for classification
- Run both prompts against the test set
- Implement accuracy and F1-score metrics
- Run a statistical test to determine if one is significantly better
- Generate a report showing:
- Overall accuracy for each prompt
- Per-category (positive vs negative) breakdown
- P-value and statistical significance
- Per-example failures (which reviews each prompt got wrong)
Starter code:
test_cases = [
TestCase("test_001", "This product is amazing!", "positive"),
TestCase("test_002", "Terrible quality, waste of money", "negative"),
# ... 48 more
]
prompt_a = "Classify as positive or negative: {input}"
prompt_b = "Is this review positive or negative? {input}"
harness = PromptTestHarness(test_cases)
# ... continue with testing
Submit your code, the comparison results, and your interpretation of whether one prompt is statistically significantly better than the other.
Summary
In this lesson, you’ve learned:
- Why casual testing fails and what rigorous testing looks like
- How to build test harnesses that automate prompt evaluation
- How to create representative test sets with stratified sampling
- Statistical methods (paired t-tests) to determine significance
- How to analyze results by category to understand prompt strengths/weaknesses
- Tools like PromptFoo and LangSmith that automate testing
- A complete production-ready testing framework
Next, you’ll learn how different models respond differently to prompts, and how to optimize for specific models rather than just one generic approach.