Systematic Prompt Testing and A/B Testing

Introduction

You now understand what makes a good prompt. The next question is: how do you actually compare two prompts scientifically? Intuition isn’t good enough. You could try both prompts, get a few examples back, and pick whichever feels better. But that’s how you end up deploying a prompt that works on your cherry-picked examples but fails in production.

This lesson teaches you how to build test harnesses, create representative test sets, run A/B tests, and determine statistical significance. By the end, you’ll have a systematic way to answer the question: “Should I use prompt A or prompt B?”

Key Takeaway: Good prompts aren’t discovered by accident. They’re tested rigorously against holdout test sets, compared with proper statistical methods, and deployed only after proving they work better than alternatives.

The Problem with Casual Testing

Let’s start with what NOT to do:

# BAD: Casual testing without rigor
prompt_a = "Summarize this in one sentence: {text}"
prompt_b = "Create a concise one-sentence summary: {text}"

text = "The company reported record earnings today..."

response_a = model.generate(prompt_a.format(text=text))
response_b = model.generate(prompt_b.format(text=text))

if len(response_a) < len(response_b):
    print("Prompt A is better because it's shorter!")

Why is this bad?

One example is meaningless: Maybe prompt A just happened to get lucky on this one example
No consistent metric: “Shorter” might not be what you actually care about
No statistical rigor: You can’t claim one is better with N=1
Sampling bias: You picked one example, probably without thinking about whether it’s representative

Here’s a better approach:

Building a Test Harness

A test harness automates the process of running prompts against test cases and collecting metrics:

from dataclasses import dataclass
from typing import List, Callable, Dict
import json
import time

@dataclass
class TestCase:
    """A single test case with input and expected output"""
    test_id: str
    input: str
    expected_output: str
    category: str = "general"  # Optional: for stratified analysis

@dataclass
class TestResult:
    """Result of running one prompt on one test case"""
    test_id: str
    prompt_name: str
    output: str
    latency: float
    metrics: Dict[str, float]

class PromptTestHarness:
    """Systematically test prompts against a fixed test set"""

    def __init__(self, test_cases: List[TestCase]):
        self.test_cases = test_cases
        self.results = []

    def run_prompt(self,
                  prompt_template: str,
                  model_fn: Callable,
                  prompt_name: str) -> List[TestResult]:
        """Run a prompt against all test cases"""

        results = []

        for test_case in self.test_cases:
            # Prepare the prompt
            prompt = prompt_template.format(input=test_case.input)

            # Measure latency
            start_time = time.time()
            output = model_fn(prompt)
            latency = time.time() - start_time

            # Compute metrics
            metrics = {
                'exact_match': int(output.strip() == test_case.expected_output.strip()),
                'contains_key_words': self._contains_key_words(
                    output, test_case.expected_output
                ),
                'length_ratio': len(output.split()) / max(1, len(test_case.expected_output.split()))
            }

            result = TestResult(
                test_id=test_case.test_id,
                prompt_name=prompt_name,
                output=output,
                latency=latency,
                metrics=metrics
            )
            results.append(result)

        self.results.extend(results)
        return results

    def _contains_key_words(self, output: str, expected: str) -> float:
        """Simple metric: percentage of expected words in output"""
        expected_words = set(expected.lower().split())
        output_words = set(output.lower().split())
        if not expected_words:
            return 1.0
        overlap = len(expected_words & output_words)
        return overlap / len(expected_words)

    def compare_prompts(self,
                       prompt_a: str,
                       prompt_b: str,
                       model_fn: Callable,
                       metric: str = 'exact_match') -> Dict:
        """Run A/B test comparing two prompts"""

        results_a = self.run_prompt(prompt_a, model_fn, "Prompt A")
        results_b = self.run_prompt(prompt_b, model_fn, "Prompt B")

        # Extract metrics for comparison
        scores_a = [r.metrics[metric] for r in results_a]
        scores_b = [r.metrics[metric] for r in results_b]

        return {
            'prompt_a_avg': sum(scores_a) / len(scores_a),
            'prompt_b_avg': sum(scores_b) / len(scores_b),
            'prompt_a_results': results_a,
            'prompt_b_results': results_b,
            'difference': sum(scores_b) / len(scores_b) - sum(scores_a) / len(scores_a)
        }

Creating Representative Test Datasets

Your test set is crucial. It determines whether your evaluation reflects real-world performance. Here’s how to build one:

1. Stratified Sampling

Make sure your test set covers different categories of inputs:

def build_stratified_test_set(examples: List[Dict],
                              category_field: str = 'category',
                              examples_per_category: int = 10) -> List[TestCase]:
    """Build a test set with balanced representation"""

    # Group examples by category
    by_category = {}
    for example in examples:
        cat = example.get(category_field, 'unknown')
        if cat not in by_category:
            by_category[cat] = []
        by_category[cat].append(example)

    # Sample from each category
    test_cases = []
    test_id = 0

    for category, category_examples in by_category.items():
        # Take up to examples_per_category from each
        sampled = category_examples[:examples_per_category]

        for example in sampled:
            test_cases.append(TestCase(
                test_id=f"test_{test_id:04d}",
                input=example['input'],
                expected_output=example['expected'],
                category=category
            ))
            test_id += 1

    return test_cases

# Example usage
all_examples = [
    {'input': 'What is 2+2?', 'expected': '4', 'category': 'math'},
    {'input': 'Explain photosynthesis', 'expected': 'Plants use...', 'category': 'science'},
    {'input': 'Who won the 2020 election?', 'expected': 'Joe Biden', 'category': 'trivia'},
    # ... many more examples
]

test_set = build_stratified_test_set(all_examples, examples_per_category=5)
print(f"Test set: {len(test_set)} cases across {len(set(t.category for t in test_set))} categories")

2. Avoid Contamination

Never use examples from your training/prompt-development data in your test set:

def split_data(all_examples, test_fraction=0.2):
    """Split into train (for prompt development) and test (for evaluation)"""
    import random

    random.shuffle(all_examples)
    split_point = int(len(all_examples) * (1 - test_fraction))

    train_set = all_examples[:split_point]
    test_set = all_examples[split_point:]

    return train_set, test_set

# Good practice
train_examples, test_examples = split_data(all_examples, test_fraction=0.2)

# Develop prompts using train_examples
prompt_a = "Based on the pattern in these examples..."
prompt_b = "Let me analyze this input..."

# Evaluate on test_examples only
harness = PromptTestHarness(
    [TestCase(f"test_{i}", ex['input'], ex['expected'])
     for i, ex in enumerate(test_examples)]
)

3. Edge Cases and Adversarial Examples

Your test set should include hard cases, not just easy ones:

def add_edge_cases(test_cases: List[TestCase]) -> List[TestCase]:
    """Include hard cases that might break prompts"""

    edge_cases = [
        # Empty/minimal inputs
        TestCase("edge_empty", "", "N/A", "edge"),

        # Extremely long inputs
        TestCase("edge_long", "A" * 5000, "Should handle long input", "edge"),

        # Ambiguous/conflicting inputs
        TestCase("edge_ambiguous",
                "Is 50 degrees hot or cold?",
                "It depends on context",
                "edge"),

        # Inputs similar to training data but slightly different
        TestCase("edge_distribution_shift",
                "The cat sat on the mat... the dog played with the ball",
                "Should handle distribution shift",
                "edge"),

        # Adversarial/tricky inputs
        TestCase("edge_adversarial",
                "Count the number of 'e's in 'enterprise'",
                "3",
                "edge"),
    ]

    return test_cases + edge_cases

Running A/B Tests with Statistical Rigor

Once you have a test set, you need proper statistical methods to compare prompts. Here’s why casual comparison fails:

If prompt A scores 8/10 and prompt B scores 7/10, is A really better? Maybe B just got unlucky on this specific test set. To know for sure, you need statistical significance.

The Paired T-Test Approach

When you have the same test cases for both prompts, use a paired t-test:

from scipy import stats
import numpy as np

def paired_ttest_comparison(results_a: List[float],
                           results_b: List[float],
                           alpha: float = 0.05) -> Dict:
    """Compare two prompts with statistical significance testing"""

    # Ensure same test cases
    assert len(results_a) == len(results_b), "Results must have same length"

    # Calculate statistics
    mean_a = np.mean(results_a)
    mean_b = np.mean(results_b)
    difference = mean_b - mean_a

    # Run paired t-test
    t_statistic, p_value = stats.ttest_rel(results_b, results_a)

    # Interpret
    is_significant = p_value < alpha

    return {
        'prompt_a_mean': mean_a,
        'prompt_b_mean': mean_b,
        'difference': difference,
        'p_value': p_value,
        'is_significant': is_significant,
        'interpretation': (
            f"Prompt B is {'statistically significantly' if is_significant else 'NOT'} "
            f"better than A (p={p_value:.4f})"
        )
    }

# Example
scores_prompt_a = [0.9, 0.8, 0.7, 0.85, 0.92, 0.88, 0.91, 0.79, 0.84, 0.86]
scores_prompt_b = [0.92, 0.85, 0.74, 0.87, 0.94, 0.89, 0.93, 0.82, 0.86, 0.88]

comparison = paired_ttest_comparison(scores_prompt_a, scores_prompt_b)
print(comparison['interpretation'])

Understanding Statistical Significance

The p-value tells you: “If these two prompts were actually identical in quality, what’s the probability of seeing a difference this large by chance?”

p < 0.05: This difference is statistically significant (less than 5% chance of getting this by random luck). You can confidently say B is better.
p >= 0.05: This difference might just be random variation. You can’t confidently say one is better.

Minimum Sample Size

How many test cases do you need? Typically 20-50 per prompt for reliable results:

def estimate_required_sample_size(effect_size: float = 0.2,
                                 alpha: float = 0.05,
                                 power: float = 0.8) -> int:
    """Estimate minimum sample size for A/B test"""
    from scipy.stats import t

    # Using t-test power calculation
    # This is a simplified version; use statsmodels for more accuracy
    t_crit = t.ppf(1 - alpha/2, df=100)
    z_beta = 0.84  # for power=0.80

    n = ((2 * (t_crit + z_beta)**2) / (effect_size**2))
    return int(n)

sample_size = estimate_required_sample_size(effect_size=0.2)
print(f"Need at least {sample_size} test cases per prompt")

Implementing a Complete Testing Workflow

Here’s a complete, production-ready example:

import json
from datetime import datetime
from pathlib import Path

class PromptTestingFramework:
    """End-to-end prompt testing with reporting"""

    def __init__(self, test_set: List[TestCase], output_dir: str = "./test_results"):
        self.test_set = test_set
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.results = {}

    def test_prompt(self,
                   prompt_name: str,
                   prompt_template: str,
                   model_fn: Callable,
                   metric_functions: Dict[str, Callable]) -> Dict:
        """Test a single prompt and compute all metrics"""

        results = []
        test_timestamp = datetime.now().isoformat()

        for test_case in self.test_set:
            prompt = prompt_template.format(input=test_case.input)

            # Generate output
            start = time.time()
            output = model_fn(prompt)
            latency = time.time() - start

            # Compute all metrics
            metrics = {}
            for metric_name, metric_fn in metric_functions.items():
                try:
                    score = metric_fn(output, test_case.expected_output)
                    metrics[metric_name] = score
                except Exception as e:
                    metrics[metric_name] = f"error: {str(e)}"

            results.append({
                'test_id': test_case.test_id,
                'category': test_case.category,
                'output': output,
                'expected': test_case.expected_output,
                'latency': latency,
                'metrics': metrics
            })

        # Summary statistics
        summary = {
            'prompt_name': prompt_name,
            'timestamp': test_timestamp,
            'num_tests': len(results),
            'avg_latency': np.mean([r['latency'] for r in results]),
            'metrics_summary': {}
        }

        for metric_name in metric_functions.keys():
            scores = [r['metrics'][metric_name] for r in results
                     if isinstance(r['metrics'][metric_name], (int, float))]
            if scores:
                summary['metrics_summary'][metric_name] = {
                    'mean': np.mean(scores),
                    'std': np.std(scores),
                    'min': min(scores),
                    'max': max(scores)
                }

        self.results[prompt_name] = {
            'results': results,
            'summary': summary
        }

        return summary

    def compare_prompts(self, prompt_a: str, prompt_b: str) -> Dict:
        """Compare two prompts with statistical testing"""

        if prompt_a not in self.results or prompt_b not in self.results:
            raise ValueError("Both prompts must be tested first")

        results_a = self.results[prompt_a]['results']
        results_b = self.results[prompt_b]['results']

        # Get scores for main metric (e.g., 'accuracy')
        scores_a = [r['metrics']['accuracy'] for r in results_a
                   if isinstance(r['metrics']['accuracy'], (int, float))]
        scores_b = [r['metrics']['accuracy'] for r in results_b
                   if isinstance(r['metrics']['accuracy'], (int, float))]

        # Statistical test
        comparison = paired_ttest_comparison(scores_a, scores_b)

        # Per-category analysis
        category_analysis = {}
        for test_case in self.test_set:
            cat = test_case.category
            if cat not in category_analysis:
                category_analysis[cat] = {'a': [], 'b': []}

            # Find corresponding results
            for r_a in results_a:
                if r_a['test_id'] == test_case.test_id:
                    if isinstance(r_a['metrics']['accuracy'], (int, float)):
                        category_analysis[cat]['a'].append(r_a['metrics']['accuracy'])

            for r_b in results_b:
                if r_b['test_id'] == test_case.test_id:
                    if isinstance(r_b['metrics']['accuracy'], (int, float)):
                        category_analysis[cat]['b'].append(r_b['metrics']['accuracy'])

        category_summary = {}
        for cat, scores in category_analysis.items():
            if scores['a'] and scores['b']:
                category_summary[cat] = {
                    'a_mean': np.mean(scores['a']),
                    'b_mean': np.mean(scores['b']),
                    'b_better': np.mean(scores['b']) > np.mean(scores['a'])
                }

        return {
            'overall_comparison': comparison,
            'category_analysis': category_summary
        }

    def generate_report(self, output_file: str = "test_report.json"):
        """Generate JSON report of all tests"""

        report = {
            'timestamp': datetime.now().isoformat(),
            'test_set_size': len(self.test_set),
            'prompts_tested': list(self.results.keys()),
            'summaries': {name: data['summary'] for name, data in self.results.items()}
        }

        output_path = self.output_dir / output_file
        with open(output_path, 'w') as f:
            json.dump(report, f, indent=2)

        print(f"Report saved to {output_path}")
        return report

# Usage example
def exact_match(output, expected):
    return int(output.strip() == expected.strip())

def contains_key_info(output, expected):
    words = set(expected.lower().split())
    return sum(1 for w in words if w in output.lower()) / len(words)

framework = PromptTestingFramework(test_set)

# Test prompt A
framework.test_prompt(
    "Prompt A",
    "Answer briefly: {input}",
    model_fn=my_model.generate,
    metric_functions={
        'accuracy': exact_match,
        'relevance': contains_key_info
    }
)

# Test prompt B
framework.test_prompt(
    "Prompt B",
    "Provide a concise answer: {input}",
    model_fn=my_model.generate,
    metric_functions={
        'accuracy': exact_match,
        'relevance': contains_key_info
    }
)

# Compare
comparison = framework.compare_prompts("Prompt A", "Prompt B")
print(f"Result: {comparison['overall_comparison']['interpretation']}")

# Generate report
framework.generate_report()

Using Existing Tools

You don’t have to build everything from scratch. These tools provide testing infrastructure:

PromptFoo

A CLI tool for testing prompts:

# Define prompts and test cases in YAML
# prompts.yaml
- id: prompt-a
  template: "Answer briefly: {input}"

- id: prompt-b
  template: "Provide a concise answer: {input}"

# tests.yaml
- input: "What is 2+2?"
  expected: "4"

- input: "Explain gravity"
  expected: "Force that attracts objects with mass"

# Run comparison
promptfoo compare -p prompts.yaml -t tests.yaml

LangSmith

Langchain’s built-in evaluation and monitoring:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Log prompt runs
client.create_run(
    name="prompt-test",
    inputs={"question": "What is 2+2?"},
    outputs={"answer": "4"}
)

# Evaluate and track
results = evaluate(
    lambda x: model.invoke(prompt.format(input=x['input'])),
    data=dataset,
    evaluators=[accuracy_evaluator, relevance_evaluator],
    experiment_prefix="my-experiment"
)

Exercise: Build a Testing Pipeline

Build a complete A/B testing pipeline for a text classification task (classify reviews as positive/negative):

Create 50 test cases (25 positive, 25 negative reviews)
Write two different prompts for classification
Run both prompts against the test set
Implement accuracy and F1-score metrics
Run a statistical test to determine if one is significantly better
Generate a report showing:
- Overall accuracy for each prompt
- Per-category (positive vs negative) breakdown
- P-value and statistical significance
- Per-example failures (which reviews each prompt got wrong)

Starter code:

test_cases = [
    TestCase("test_001", "This product is amazing!", "positive"),
    TestCase("test_002", "Terrible quality, waste of money", "negative"),
    # ... 48 more
]

prompt_a = "Classify as positive or negative: {input}"
prompt_b = "Is this review positive or negative? {input}"

harness = PromptTestHarness(test_cases)
# ... continue with testing

Submit your code, the comparison results, and your interpretation of whether one prompt is statistically significantly better than the other.

Summary

In this lesson, you’ve learned:

Why casual testing fails and what rigorous testing looks like
How to build test harnesses that automate prompt evaluation
How to create representative test sets with stratified sampling
Statistical methods (paired t-tests) to determine significance
How to analyze results by category to understand prompt strengths/weaknesses
Tools like PromptFoo and LangSmith that automate testing
A complete production-ready testing framework

Next, you’ll learn how different models respond differently to prompts, and how to optimize for specific models rather than just one generic approach.