Debugging and Iterating on Prompts

Even with all the techniques you’ve learned, prompts sometimes fail. The model might misunderstand, go off-topic, be too verbose, or miss nuance. When this happens, you need a systematic way to diagnose and fix the problem. This is where debugging and iteration come in. Think of it like testing software: you observe failures, form hypotheses about the cause, adjust, and test again.

When Prompts Fail: Common Failure Modes

Understanding why a prompt fails is the first step to fixing it. Here are the most common failure modes:

Failure Mode 1: The Model Misunderstands the Task

The model produces something syntactically correct but fundamentally wrong.

Symptom:

You ask: "Summarize this article about climate change"
Model produces: A personal essay about why climate is important
Expected: A factual summary of the article's main points

Root causes:

“Summarize” is ambiguous (summary of what? The main points? Key arguments?)
Insufficient context about what you want extracted
The model defaults to a generic interpretation

How to fix:

Be more specific about what “summary” means for YOUR task
Provide an example of a good summary
Add constraints (format, length, focus areas)

Better prompt:

Extract the 3 main arguments from this article about climate change.
Format as a numbered list, 1-2 sentences per argument.
Focus on arguments the author explicitly makes, not your own interpretation.

Article:
[...]

Failure Mode 2: Over-Explanation (Verbosity)

The model adds unnecessary context, disclaimers, or background.

Symptom:

You ask: "Is 2+2 equal to 4?"
Model responds: "In standard mathematics, when we consider the arithmetic
operation of addition, two units plus two units equal four units. This is
based on the foundational axioms of arithmetic, which have been proven..."
Expected: "Yes" or "Yes, 2+2=4"

Root causes:

No constraint on length or format
The model defaults to being thorough
No example showing brevity

How to fix:

Add explicit length constraints
Use negative instructions (“Don’t explain basic concepts”)
Provide a terse example
Specify output format strictly

Better prompt:

Is 2+2 equal to 4?
Answer with only "Yes" or "No". No explanation needed.

Failure Mode 3: Hallucination (Making Things Up)

The model invents facts, statistics, or sources that don’t exist.

Symptom:

You ask: "What did the CEO say about this in their earnings call?"
Model responds: "The CEO mentioned that Q3 revenue grew 40% and they
plan to expand into 5 new markets" (but the CEO said no such thing)

Root causes:

The model doesn’t have the specific information
No constraint against fabrication
The task seems to demand a confident answer

How to fix:

Add explicit anti-hallucination instructions
Require citations or quotes
Use phrases like “Based on what you see in the document…”
Ask for “unknown” if the answer isn’t in the provided text

Better prompt:

Based ONLY on the earnings call transcript provided below, what did the CEO say
about revenue growth?

If the CEO did not discuss revenue growth, respond: "Not mentioned in this call"

Do NOT make up numbers or claims not explicitly stated in the transcript.

Transcript:
[...]

Failure Mode 4: Off-Topic or Tangential

The model drifts from your core request into related but wrong territory.

Symptom:

You ask: "How do I improve my typing speed?"
Model responds with tips about ergonomic chairs, desk setup, and
maintaining good posture (technically helpful but not what you asked for)

Root causes:

The prompt is too open-ended
The model follows associations rather than strict instructions
No clear boundaries around scope

How to fix:

Narrow the scope with constraints
Use negative instructions to exclude related topics
Provide examples of on-topic responses
Use XML/markdown delimiters to emphasize what you want

Better prompt:

How do I improve my typing speed? Focus specifically on:
1. Typing technique and finger placement
2. Drills and practice exercises
3. Tools or software that help practice

DO NOT include:
- Ergonomic setup (monitor height, chair, etc.)
- Health/posture advice
- General wellness tips

Format: Numbered list of 5 specific techniques

Failure Mode 5: Wrong Format or Structure

The output is correct but in the wrong format.

Symptom:

You ask: "List 3 benefits of exercise"
Model responds: "Exercise offers many benefits, including improved
cardiovascular health and stronger muscles, plus better mental health..."
Expected: - Cardiovascular health
          - Stronger muscles
          - Better mental health

Root causes:

Format constraint was vague or missing
The model’s default format doesn’t match your need
No example showing the desired format

How to fix:

Specify format extremely explicitly
Provide an example of correctly formatted output
Use strict delimiters or templates
Test the prompt with a simple, unambiguous example first

Better prompt:

List 3 benefits of exercise. Format as a bullet-point list.

Benefits:
- [benefit 1]
- [benefit 2]
- [benefit 3]

Example of correct format:
- Improves cardiovascular health
- Increases muscle strength
- Enhances mental well-being

Prompt engineering is iterative. You rarely get it perfect on the first try. Here’s the systematic process:

Step 1: Test (Run the Prompt)

Run your prompt against the model and save the output.

Prompt v1:
"Write a blog post about remote work"

Output:
[Generic 500-word piece that could apply to any company]

Step 2: Analyze (Diagnose the Problem)

Ask yourself: What’s wrong? Is it a problem with:

Clarity? (Does the model understand the task?)
Format? (Is the output shaped correctly?)
Content? (Is it the right information?)
Tone? (Does it sound right?)
Completeness? (Is anything missing?)

Analysis:
- Problem: Too generic, no target audience
- Root cause: Prompt didn't specify who this is for
- Failure mode: The model defaulted to generic
- Fix strategy: Add specific audience and context

Step 3: Adjust (Modify the Prompt)

Make a targeted change based on your analysis. Change only one or two things at a time so you can isolate what works.

Prompt v2 (Adjusted):
"Write a 800-word blog post about remote work for a SaaS startup founder
who's deciding whether to go fully remote. Focus on pros and cons specific
to scaling a technical team. Keep tone conversational and practical."

Output:
[Much more targeted, but still missing specific founder pain points]

Step 4: Retest (Evaluate the Change)

Run the new prompt and compare to the previous output. Is it better? Worse? Same?

Comparison:
v1 output: Generic, no audience
v2 output: Better targeted, more specific, but missing founder perspective

Next issue: Should address founder-specific concerns (hiring, retention, etc.)

Step 5: Iterate Again

Return to step 1 with your new understanding.

Prompt v3 (Further adjusted):
"Write an 800-word blog post about remote work for a SaaS founder who's
building a technical team. Address:
1. How remote work affects hiring and retention
2. Communication and culture challenges
3. Productivity and collaboration in engineering teams

Tone: Conversational, practical, from experience of someone who's done this.
Include 1-2 specific examples of challenges you've faced or heard about.

DO NOT: Add generic pros/cons everyone already knows"

Output:
[Much more useful and specific]

Prompt Logging and Version Control Strategies

As you iterate, keep track of your work. This prevents regressions and lets you understand what worked.

Simple Logging in a Text File

## Remote Work Blog Post

### Attempt 1 - Initial Prompt
Prompt: "Write a blog post about remote work"
Result: Generic, no clear audience
Issues: Too open-ended, defaults to median content

### Attempt 2 - Added Audience
Prompt: [full prompt v2]
Result: Better focused, but missing founder perspective
Issues: Still not specific enough to founder pain points

### Attempt 3 - Added Specific Elements
Prompt: [full prompt v3]
Result: Excellent. Specific founder perspective, practical advice
Status: ✅ WORKING

### Final Prompt (v3)
[Store the final working prompt here for reuse]

Structured Logging in JSON

For more systematic tracking:

{
  "task": "Generate blog post about remote work",
  "iterations": [
    {
      "version": 1,
      "prompt": "Write a blog post about remote work",
      "score": 2,
      "issues": ["too generic", "no audience"],
      "next_adjustment": "add specific audience"
    },
    {
      "version": 2,
      "prompt": "Write a blog post about remote work for SaaS founder...",
      "score": 6,
      "issues": ["missing founder pain points"],
      "next_adjustment": "add specific founder challenges"
    },
    {
      "version": 3,
      "prompt": "Write an 800-word blog post about remote work for a SaaS founder addressing hiring, communication, and engineering team productivity...",
      "score": 9,
      "issues": [],
      "status": "WORKING"
    }
  ]
}

Version Control Strategy

If you use git (version control), store prompts as files:

prompts/
├── blog-posts/
│   └── remote-work-blog.md
│       v1: too generic
│       v2: added audience
│       v3: added founder focus (FINAL)
├── code-reviews/
│   └── python-code-review.md
└── data-analysis/
    └── financial-analysis.md

Commit changes with clear messages:

git add prompts/blog-posts/remote-work-blog.md
git commit -m "feat: improve remote work blog prompt with founder perspective"

Temperature, Top-P, and Other Parameters That Affect Output

Beyond the prompt itself, model parameters affect output. Understanding these helps you debug.

Temperature (Randomness)

Default: 1.0 - Balanced randomness and consistency

# Low temperature: Deterministic
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    temperature=0.0,  # Always picks highest probability
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
# Output: Deterministic, same every time

# High temperature: Creative
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    temperature=2.0,  # Considers even low-probability options
    messages=[{"role": "user", "content": "Tell me a story about a cat"}]
)
# Output: Highly variable, different each run

When to adjust:

Lower (0.0-0.5) for: Factual tasks, code generation, consistency needed
Higher (1.0-2.0) for: Creative writing, brainstorming, ideation

Top-P (Nucleus Sampling)

Default: 1.0 - Consider all possible tokens

Top-P controls what percentage of the probability distribution is considered:

# top_p = 0.9: Consider only tokens that make up top 90% probability
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    top_p=0.9,
    messages=[{"role": "user", "content": "Write a poem about spring"}]
)
# More focused, less random

Interaction with temperature:

Temperature controls how strictly to follow the distribution
Top-P controls which tokens are in the distribution
Using both: First filter with top-p, then apply temperature to those

Max Tokens

Limits the response length:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=100,  # Response capped at 100 tokens
    messages=[{"role": "user", "content": "Write a novel"}]
)
# Forces brevity

Troubleshooting with Parameters

Problem	Parameter	Adjustment
Output is random/inconsistent	temperature	Lower to 0.0-0.3
Output is too generic/repetitive	temperature	Raise to 0.7-1.5
Getting low-probability wrong answers	top_p	Lower to 0.5-0.8
Response is too long	max_tokens	Set explicit limit
Model rambles or goes off-topic	max_tokens + temperature	Lower both

A/B Testing Prompts Informally

Without extensive infrastructure, you can still compare prompts:

Simple A/B Test

Task: Generate email subject lines for a newsletter

VARIANT A:
"Generate 5 email subject lines for a tech newsletter"

VARIANT B:
"Generate 5 email subject lines for a tech newsletter.
Goal: High open rate (30%+)
Audience: Software engineers who value learning
Style: Specific insight or tip, not clickbait
Format: Numbered list"

Run both prompts 3 times each, note which variant produces better results.

Scoring Framework

Score outputs on criteria that matter:

Criteria: (1=bad, 5=excellent)
- Relevance to task: ___
- Clarity: ___
- Completeness: ___
- Usability: ___

VARIANT A average score: ___
VARIANT B average score: ___

Quick Comparison Script

import anthropic

client = anthropic.Anthropic()

def test_variant(variant_name, prompt):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

variant_a = "Write a poem about mountains"
variant_b = "Write a 12-line poem about mountain peaks using vivid imagery"

output_a = test_variant("A", variant_a)
output_b = test_variant("B", variant_b)

print("VARIANT A:")
print(output_a)
print("\nVARIANT B:")
print(output_b)

Key Takeaway

Prompt failures follow predictable patterns: misunderstanding, verbosity, hallucination, off-topic, wrong format. Use a systematic iteration loop: test → analyze → adjust → retest. Log your attempts to avoid regression and understand what works. Parameters like temperature and top_p fine-tune behavior. Even informal A/B testing helps you choose between prompt variants.

Exercise: Take a Failing Prompt, Diagnose and Fix It

In this exercise, you’ll practice the full debugging workflow.

The Failing Prompt

Prompt: "List reasons why people procrastinate"

Output:
"People procrastinate for various psychological and external reasons.
Some avoid tasks they find unpleasant or anxiety-inducing. Others struggle
with time management or perfectionism. External factors like distractions,
lack of resources, or unclear deadlines can also contribute. Additionally,
low self-discipline and difficulty prioritizing tasks are common factors..."

Your Task

Analyze the failure
- What’s wrong with this output?
- Is it a clarity issue? Format issue? Tone issue?
- What failure mode is this? (From the lesson)
Identify the root cause
- Why did the prompt produce this output?
- What did you NOT specify in the original prompt?
Create improved versions
- Write Prompt v2 with targeted fixes
- Write Prompt v3 with additional improvements
- For each version, explain what you changed and why
Test your prompts (optional but recommended)
- Run your improved prompts on an actual LLM
- Compare the outputs
- Which version is best? Why?
Document your process
- Write a short summary (200-300 words) of:
  - The original failure and why it happened
  - Your diagnostic process
  - The fixes you applied
  - How the final version is better

Example Structure for Your Answer

## Original Analysis
Failure: Generic, over-explained response
Root cause: No constraints on format, depth, or audience

## Prompt v2 (Targeted Fix)
[New prompt + explanation of change]
Result: [What improved?]

## Prompt v3 (Further Refinement)
[Another iteration]
Result: [What improved further?]

## Final Assessment
The most effective version was: [v2/v3]
Because: [explanation]

Bonus Challenge

Take the original output and show how you would edit it into what the improved prompt would generate. This helps you understand the gap between vague and specific prompts.

Debugging and Iterating on Prompts

Debugging and Iterating on Prompts

When Prompts Fail: Common Failure Modes

Failure Mode 1: The Model Misunderstands the Task

Failure Mode 2: Over-Explanation (Verbosity)

Failure Mode 3: Hallucination (Making Things Up)

Failure Mode 4: Off-Topic or Tangential

Failure Mode 5: Wrong Format or Structure

The Iterative Refinement Loop: Test → Analyze → Adjust → Retest

Step 1: Test (Run the Prompt)

Step 2: Analyze (Diagnose the Problem)

Step 3: Adjust (Modify the Prompt)

Step 4: Retest (Evaluate the Change)

Step 5: Iterate Again

Prompt Logging and Version Control Strategies

Simple Logging in a Text File

Structured Logging in JSON

Version Control Strategy

Temperature, Top-P, and Other Parameters That Affect Output

Temperature (Randomness)

Top-P (Nucleus Sampling)

Max Tokens

Troubleshooting with Parameters

A/B Testing Prompts Informally

Simple A/B Test

Scoring Framework

Quick Comparison Script

Key Takeaway

Exercise: Take a Failing Prompt, Diagnose and Fix It

The Failing Prompt

Your Task

Example Structure for Your Answer

Bonus Challenge