Defining Prompt Quality Metrics

Introduction

In the Foundations phase, you learned how to write prompts that work. Now it’s time to measure whether they work well. Moving from intuition to measurement is a critical shift as you develop production systems. When you say “this prompt is better,” you need data to back it up.

This lesson teaches you how to think about prompt quality systematically. Instead of vaguely feeling that one prompt is “clearer” or “more accurate,” you’ll define metrics that let you compare prompts objectively.

Key Takeaway: Good prompts can’t be evaluated by gut feel alone. You need quantifiable metrics to understand what works, predict performance on new tasks, and make confident decisions about which prompt to deploy.

What Makes a Prompt “Good”?

Before we can measure anything, let’s define what “good” means for your use case. Different applications value different things:

A classification system cares most about accuracy and consistency
A content generator cares about creativity, diversity, and relevance
A summarizer cares about brevity, completeness, and key point extraction
A chatbot cares about response latency, safety, and user satisfaction

The prompt that excels at one task might fail at another. This is why we start with clarity about what success looks like for your specific problem.

Core Quality Dimensions

Let’s explore five fundamental dimensions of prompt quality:

1. Accuracy

Definition: Does the model produce the correct output?

For deterministic tasks (classification, extraction, arithmetic), accuracy is binary or straightforward:

def evaluate_accuracy(predictions, ground_truth):
    correct = sum(p == g for p, g in zip(predictions, ground_truth))
    return correct / len(predictions)

For example, if your prompt is designed to classify customer feedback as “positive,” “negative,” or “neutral,” you’d count how many classifications match human labels.

However, accuracy becomes fuzzy for generative tasks. A customer support response might be helpful even if it doesn’t match the expected answer exactly.

2. Relevance

Definition: Do the outputs address the input question or task?

This matters most for open-ended tasks like summarization, question-answering, or content generation. A response might be grammatically perfect but completely miss the point.

You can measure relevance through:

Keyword matching: Does the response mention key concepts from the query?
Semantic similarity: Using embeddings to measure how similar the response is to the expected content
Human judgment: Raters score relevance on a 1-5 scale

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_relevance(response, expected_response, embedding_fn):
    """Compare response to expected output using embeddings"""
    emb_response = embedding_fn(response)
    emb_expected = embedding_fn(expected_response)
    similarity = cosine_similarity([emb_response], [emb_expected])[0][0]
    return similarity  # 0 to 1, higher is better

3. Consistency

Definition: Does the prompt produce similar outputs for similar inputs?

Consistency is crucial for production systems. If the same customer question gets answered differently each time, users lose trust. You measure consistency by:

Running the same prompt multiple times with temperature settings that control randomness
Checking agreement between multiple model calls
Measuring variance across outputs

def consistency_score(prompt, input_text, model_fn, num_runs=5):
    """Measure how consistent a prompt is across multiple calls"""
    outputs = [model_fn(prompt, input_text) for _ in range(num_runs)]

    # Measure pairwise similarity
    similarities = []
    for i in range(len(outputs)):
        for j in range(i+1, len(outputs)):
            sim = semantic_similarity(outputs[i], outputs[j])
            similarities.append(sim)

    return np.mean(similarities)  # Average agreement

4. Latency

Definition: How long does the prompt take to produce an output?

Token generation speed matters in interactive systems. A 2-second response feels instant; a 10-second response feels slow. Latency depends on:

The model (larger models are slower)
The prompt length (longer prompts mean more tokens to process)
The expected output length (generating 1000 tokens takes longer than generating 50)

Track latency in your evaluation:

import time

def measure_latency(prompt, input_text, model_fn, num_runs=3):
    """Measure average response time"""
    times = []
    for _ in range(num_runs):
        start = time.time()
        response = model_fn(prompt, input_text)
        end = time.time()
        times.append(end - start)

    return np.mean(times), np.std(times)

5. Cost

Definition: What does this prompt cost per call?

Model pricing is token-based. Longer prompts and longer outputs cost more. As you optimize, you’ll often face tradeoffs: a more detailed system prompt might improve accuracy but increase costs.

def estimate_cost(prompt, avg_output_tokens, model="gpt-4"):
    """Estimate cost per call"""
    costs = {
        "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
        "gpt-3.5": {"input": 0.0005, "output": 0.0015},
        "claude-3-opus": {"input": 0.015, "output": 0.075},
    }

    input_tokens = len(prompt.split())  # rough estimate
    rate = costs[model]

    return (input_tokens * rate["input"] +
            avg_output_tokens * rate["output"]) / 1000

Task-Specific Evaluation Criteria

Now let’s see how these dimensions combine for different task types:

Classification Tasks

For tasks like sentiment analysis, spam detection, or intent recognition:

Primary metric: Accuracy (or F1-score if classes are imbalanced)
Secondary metrics: Consistency (always classify the same input the same way)
Watch out for: Edge cases and ambiguous examples

from sklearn.metrics import precision_recall_fscore_support

def evaluate_classification(predictions, ground_truth, labels):
    """Full classification evaluation"""
    precision, recall, f1, support = precision_recall_fscore_support(
        ground_truth, predictions, labels=labels, average='weighted'
    )
    return {
        'accuracy': sum(p == g for p, g in zip(predictions, ground_truth)) / len(predictions),
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Extraction Tasks

For extracting structured data from text (names, dates, amounts):

Primary metric: Accuracy (does the extracted value match the correct value?)
Secondary metrics: Precision/recall for partial matches, consistency
Watch out for: Format mismatches (extracting “3/15/2024” when you want “2024-03-15”)

def evaluate_extraction(predicted_values, ground_truth_values):
    """Evaluate extraction accuracy"""
    exact_matches = sum(p == g for p, g in zip(predicted_values, ground_truth_values))
    partial_matches = sum(
        str(p).lower() in str(g).lower() or str(g).lower() in str(p).lower()
        for p, g in zip(predicted_values, ground_truth_values)
    )
    return {
        'exact_match_rate': exact_matches / len(predicted_values),
        'partial_match_rate': partial_matches / len(predicted_values)
    }

Generation Tasks

For summarization, paraphrasing, or content generation:

Primary metric: Relevance (human judges or semantic similarity)
Secondary metrics: Length appropriateness, readability
Watch out for: Hallucinations (making up facts)

def evaluate_generation(generated_texts, reference_texts, model_type='summary'):
    """Evaluate generated content"""
    metrics = {
        'avg_relevance': np.mean([
            semantic_relevance(gen, ref)
            for gen, ref in zip(generated_texts, reference_texts)
        ]),
        'avg_length': np.mean([len(text.split()) for text in generated_texts]),
        'length_consistency': np.std([len(text.split()) for text in generated_texts])
    }
    return metrics

Building Evaluation Rubrics

An evaluation rubric is a structured scorecard for assessing outputs. Here’s a template:

Rubric: Customer Support Response
Criteria:
  - Name: Answers the Question
    Scale: 1-5
    1: "Doesn't address the customer's issue at all"
    3: "Addresses part of the issue"
    5: "Completely answers the customer's question"

  - Name: Tone and Professionalism
    Scale: 1-5
    1: "Rude, dismissive, or inappropriate"
    3: "Professional but slightly impersonal"
    5: "Professional, empathetic, and helpful"

  - Name: Accuracy
    Scale: 1-5
    1: "Contains false information"
    3: "Mostly accurate with minor errors"
    5: "Completely accurate"

  - Name: Actionability
    Scale: 1-5
    1: "Gives no clear next steps"
    3: "Suggests some actions"
    5: "Provides clear, specific steps"

Scoring:
  Total Score: (sum of all criteria) / 4
  Pass Threshold: Average >= 4.0

Create a rubric by:

Identifying what success looks like for your task
Breaking it into measurable criteria
Defining clear levels or scores
Testing the rubric with examples (do different raters agree?)

Automated vs. Human Evaluation

Each approach has tradeoffs:

Automated Evaluation (Fast, Cheap, Biased)

Advantages:

Runs in seconds
Scales to thousands of examples
No human bias or fatigue
Repeatable and consistent

Disadvantages:

Hard to capture nuance
Can be gamed (outputs that “look good” to metrics but aren’t)
Requires reference answers for many metrics

When to use: Accuracy checks, length constraints, format validation, semantic similarity

Human Evaluation (Slow, Expensive, Nuanced)

Advantages:

Captures subjective qualities (tone, helpfulness, creativity)
Understand why something works or fails
Ground truth for training automated metrics

Disadvantages:

Slow (hours to weeks)
Expensive (annotators, recruitment)
Inconsistent between raters
Not scalable

When to use: First time evaluating a new task, edge cases, understanding failure modes, training data labeling

The best approach combines both:

def evaluate_with_both(prompt, test_set, model_fn):
    """Generate initial metrics with automation, validate with humans"""

    # Step 1: Run automated evaluation
    results = []
    for input_text, reference_output in test_set:
        output = model_fn(prompt, input_text)
        auto_score = automated_metric(output, reference_output)
        results.append({
            'input': input_text,
            'output': output,
            'reference': reference_output,
            'auto_score': auto_score
        })

    # Step 2: Find uncertain cases
    uncertain = [r for r in results if r['auto_score'] < 0.7]

    # Step 3: Send uncertain cases to human annotators
    human_reviews = request_human_review(uncertain)

    # Step 4: Combine scores
    for i, result in enumerate(results):
        if result in uncertain:
            result['final_score'] = human_reviews[i]
        else:
            result['final_score'] = result['auto_score']

    return results

Putting It Together: A Complete Evaluation Framework

Here’s a practical example evaluating a customer support prompt:

import json
from dataclasses import dataclass

@dataclass
class PromptEvaluation:
    """Complete evaluation of a prompt"""
    prompt: str
    test_cases: list

    def evaluate(self, model_fn):
        results = []

        for test_case in self.test_cases:
            customer_query = test_case['input']
            expected_response = test_case['expected']

            # Generate response
            actual_response = model_fn(self.prompt, customer_query)

            # Evaluate multiple dimensions
            evaluation = {
                'input': customer_query,
                'expected': expected_response,
                'actual': actual_response,
                'metrics': {
                    'relevance': semantic_relevance(actual_response, expected_response),
                    'length': len(actual_response.split()),
                    'contains_action': any(word in actual_response.lower()
                                          for word in ['click', 'contact', 'submit', 'visit']),
                    'professional_tone': tone_score(actual_response),
                }
            }
            results.append(evaluation)

        # Summary statistics
        summary = {
            'avg_relevance': np.mean([r['metrics']['relevance'] for r in results]),
            'avg_length': np.mean([r['metrics']['length'] for r in results]),
            'action_rate': sum([r['metrics']['contains_action'] for r in results]) / len(results),
            'avg_tone': np.mean([r['metrics']['professional_tone'] for r in results]),
            'total_tests': len(results)
        }

        return results, summary

# Usage
evaluation = PromptEvaluation(
    prompt="You are a helpful customer support assistant...",
    test_cases=[
        {
            'input': "How do I reset my password?",
            'expected': "Instructions for password reset"
        },
        {
            'input': "Your service is terrible!",
            'expected': "Empathetic acknowledgment and resolution path"
        }
    ]
)

results, summary = evaluation.evaluate(model_fn=my_model.generate)
print(json.dumps(summary, indent=2))

Exercise: Create an Evaluation Rubric

Create a complete evaluation rubric for a product recommendation assistant that takes user preferences and suggests products. Your rubric should:

Define 4-5 evaluation criteria
For each criterion, define a 1-5 scale with clear descriptions
Identify which metrics are automated vs. human-evaluated
Write example test cases
Determine a passing threshold

Starter template:

Rubric: Product Recommendation Assistant

Criteria:
  - Name: [Your criterion]
    Scale: 1-5
    Automated: yes/no
    1: "[Poor description]"
    3: "[Average description]"
    5: "[Excellent description]"

Test Cases:
  - input: "I want shoes that are comfortable for long hikes"
    expected_themes: ["durability", "comfort", "outdoor-appropriate"]

  - input: "I need something for business meetings"
    expected_themes: ["professional", "formal", "polished"]

Pass Threshold: [Your criteria for "good enough"]

Share your rubric and explain your choices. Why did you weight certain criteria more heavily?

Summary

In this lesson, you’ve learned:

Why measurement matters more than intuition for prompt evaluation
Five core dimensions of prompt quality: accuracy, relevance, consistency, latency, and cost
How to tailor evaluation to specific task types
How to build evaluation rubrics that capture what matters
The tradeoffs between automated and human evaluation
How to combine both approaches for robust assessment

In the next lesson, you’ll put these metrics to work with systematic testing frameworks that let you compare prompts scientifically.