Debugging and Iterating on Prompts
Debugging and Iterating on Prompts
Even with all the techniques you’ve learned, prompts sometimes fail. The model might misunderstand, go off-topic, be too verbose, or miss nuance. When this happens, you need a systematic way to diagnose and fix the problem. This is where debugging and iteration come in. Think of it like testing software: you observe failures, form hypotheses about the cause, adjust, and test again.
When Prompts Fail: Common Failure Modes
Understanding why a prompt fails is the first step to fixing it. Here are the most common failure modes:
Failure Mode 1: The Model Misunderstands the Task
The model produces something syntactically correct but fundamentally wrong.
Symptom:
You ask: "Summarize this article about climate change"
Model produces: A personal essay about why climate is important
Expected: A factual summary of the article's main points
Root causes:
- “Summarize” is ambiguous (summary of what? The main points? Key arguments?)
- Insufficient context about what you want extracted
- The model defaults to a generic interpretation
How to fix:
- Be more specific about what “summary” means for YOUR task
- Provide an example of a good summary
- Add constraints (format, length, focus areas)
Better prompt:
Extract the 3 main arguments from this article about climate change.
Format as a numbered list, 1-2 sentences per argument.
Focus on arguments the author explicitly makes, not your own interpretation.
Article:
[...]
Failure Mode 2: Over-Explanation (Verbosity)
The model adds unnecessary context, disclaimers, or background.
Symptom:
You ask: "Is 2+2 equal to 4?"
Model responds: "In standard mathematics, when we consider the arithmetic
operation of addition, two units plus two units equal four units. This is
based on the foundational axioms of arithmetic, which have been proven..."
Expected: "Yes" or "Yes, 2+2=4"
Root causes:
- No constraint on length or format
- The model defaults to being thorough
- No example showing brevity
How to fix:
- Add explicit length constraints
- Use negative instructions (“Don’t explain basic concepts”)
- Provide a terse example
- Specify output format strictly
Better prompt:
Is 2+2 equal to 4?
Answer with only "Yes" or "No". No explanation needed.
Failure Mode 3: Hallucination (Making Things Up)
The model invents facts, statistics, or sources that don’t exist.
Symptom:
You ask: "What did the CEO say about this in their earnings call?"
Model responds: "The CEO mentioned that Q3 revenue grew 40% and they
plan to expand into 5 new markets" (but the CEO said no such thing)
Root causes:
- The model doesn’t have the specific information
- No constraint against fabrication
- The task seems to demand a confident answer
How to fix:
- Add explicit anti-hallucination instructions
- Require citations or quotes
- Use phrases like “Based on what you see in the document…”
- Ask for “unknown” if the answer isn’t in the provided text
Better prompt:
Based ONLY on the earnings call transcript provided below, what did the CEO say
about revenue growth?
If the CEO did not discuss revenue growth, respond: "Not mentioned in this call"
Do NOT make up numbers or claims not explicitly stated in the transcript.
Transcript:
[...]
Failure Mode 4: Off-Topic or Tangential
The model drifts from your core request into related but wrong territory.
Symptom:
You ask: "How do I improve my typing speed?"
Model responds with tips about ergonomic chairs, desk setup, and
maintaining good posture (technically helpful but not what you asked for)
Root causes:
- The prompt is too open-ended
- The model follows associations rather than strict instructions
- No clear boundaries around scope
How to fix:
- Narrow the scope with constraints
- Use negative instructions to exclude related topics
- Provide examples of on-topic responses
- Use XML/markdown delimiters to emphasize what you want
Better prompt:
How do I improve my typing speed? Focus specifically on:
1. Typing technique and finger placement
2. Drills and practice exercises
3. Tools or software that help practice
DO NOT include:
- Ergonomic setup (monitor height, chair, etc.)
- Health/posture advice
- General wellness tips
Format: Numbered list of 5 specific techniques
Failure Mode 5: Wrong Format or Structure
The output is correct but in the wrong format.
Symptom:
You ask: "List 3 benefits of exercise"
Model responds: "Exercise offers many benefits, including improved
cardiovascular health and stronger muscles, plus better mental health..."
Expected: - Cardiovascular health
- Stronger muscles
- Better mental health
Root causes:
- Format constraint was vague or missing
- The model’s default format doesn’t match your need
- No example showing the desired format
How to fix:
- Specify format extremely explicitly
- Provide an example of correctly formatted output
- Use strict delimiters or templates
- Test the prompt with a simple, unambiguous example first
Better prompt:
List 3 benefits of exercise. Format as a bullet-point list.
Benefits:
- [benefit 1]
- [benefit 2]
- [benefit 3]
Example of correct format:
- Improves cardiovascular health
- Increases muscle strength
- Enhances mental well-being
The Iterative Refinement Loop: Test → Analyze → Adjust → Retest
Prompt engineering is iterative. You rarely get it perfect on the first try. Here’s the systematic process:
Step 1: Test (Run the Prompt)
Run your prompt against the model and save the output.
Prompt v1:
"Write a blog post about remote work"
Output:
[Generic 500-word piece that could apply to any company]
Step 2: Analyze (Diagnose the Problem)
Ask yourself: What’s wrong? Is it a problem with:
- Clarity? (Does the model understand the task?)
- Format? (Is the output shaped correctly?)
- Content? (Is it the right information?)
- Tone? (Does it sound right?)
- Completeness? (Is anything missing?)
Analysis:
- Problem: Too generic, no target audience
- Root cause: Prompt didn't specify who this is for
- Failure mode: The model defaulted to generic
- Fix strategy: Add specific audience and context
Step 3: Adjust (Modify the Prompt)
Make a targeted change based on your analysis. Change only one or two things at a time so you can isolate what works.
Prompt v2 (Adjusted):
"Write a 800-word blog post about remote work for a SaaS startup founder
who's deciding whether to go fully remote. Focus on pros and cons specific
to scaling a technical team. Keep tone conversational and practical."
Output:
[Much more targeted, but still missing specific founder pain points]
Step 4: Retest (Evaluate the Change)
Run the new prompt and compare to the previous output. Is it better? Worse? Same?
Comparison:
v1 output: Generic, no audience
v2 output: Better targeted, more specific, but missing founder perspective
Next issue: Should address founder-specific concerns (hiring, retention, etc.)
Step 5: Iterate Again
Return to step 1 with your new understanding.
Prompt v3 (Further adjusted):
"Write an 800-word blog post about remote work for a SaaS founder who's
building a technical team. Address:
1. How remote work affects hiring and retention
2. Communication and culture challenges
3. Productivity and collaboration in engineering teams
Tone: Conversational, practical, from experience of someone who's done this.
Include 1-2 specific examples of challenges you've faced or heard about.
DO NOT: Add generic pros/cons everyone already knows"
Output:
[Much more useful and specific]
Prompt Logging and Version Control Strategies
As you iterate, keep track of your work. This prevents regressions and lets you understand what worked.
Simple Logging in a Text File
## Remote Work Blog Post
### Attempt 1 - Initial Prompt
Prompt: "Write a blog post about remote work"
Result: Generic, no clear audience
Issues: Too open-ended, defaults to median content
### Attempt 2 - Added Audience
Prompt: [full prompt v2]
Result: Better focused, but missing founder perspective
Issues: Still not specific enough to founder pain points
### Attempt 3 - Added Specific Elements
Prompt: [full prompt v3]
Result: Excellent. Specific founder perspective, practical advice
Status: ✅ WORKING
### Final Prompt (v3)
[Store the final working prompt here for reuse]
Structured Logging in JSON
For more systematic tracking:
{
"task": "Generate blog post about remote work",
"iterations": [
{
"version": 1,
"prompt": "Write a blog post about remote work",
"score": 2,
"issues": ["too generic", "no audience"],
"next_adjustment": "add specific audience"
},
{
"version": 2,
"prompt": "Write a blog post about remote work for SaaS founder...",
"score": 6,
"issues": ["missing founder pain points"],
"next_adjustment": "add specific founder challenges"
},
{
"version": 3,
"prompt": "Write an 800-word blog post about remote work for a SaaS founder addressing hiring, communication, and engineering team productivity...",
"score": 9,
"issues": [],
"status": "WORKING"
}
]
}
Version Control Strategy
If you use git (version control), store prompts as files:
prompts/
├── blog-posts/
│ └── remote-work-blog.md
│ v1: too generic
│ v2: added audience
│ v3: added founder focus (FINAL)
├── code-reviews/
│ └── python-code-review.md
└── data-analysis/
└── financial-analysis.md
Commit changes with clear messages:
git add prompts/blog-posts/remote-work-blog.md
git commit -m "feat: improve remote work blog prompt with founder perspective"
Temperature, Top-P, and Other Parameters That Affect Output
Beyond the prompt itself, model parameters affect output. Understanding these helps you debug.
Temperature (Randomness)
Default: 1.0 - Balanced randomness and consistency
# Low temperature: Deterministic
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
temperature=0.0, # Always picks highest probability
messages=[{"role": "user", "content": "What is 2+2?"}]
)
# Output: Deterministic, same every time
# High temperature: Creative
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
temperature=2.0, # Considers even low-probability options
messages=[{"role": "user", "content": "Tell me a story about a cat"}]
)
# Output: Highly variable, different each run
When to adjust:
- Lower (0.0-0.5) for: Factual tasks, code generation, consistency needed
- Higher (1.0-2.0) for: Creative writing, brainstorming, ideation
Top-P (Nucleus Sampling)
Default: 1.0 - Consider all possible tokens
Top-P controls what percentage of the probability distribution is considered:
# top_p = 0.9: Consider only tokens that make up top 90% probability
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
top_p=0.9,
messages=[{"role": "user", "content": "Write a poem about spring"}]
)
# More focused, less random
Interaction with temperature:
- Temperature controls how strictly to follow the distribution
- Top-P controls which tokens are in the distribution
- Using both: First filter with top-p, then apply temperature to those
Max Tokens
Limits the response length:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100, # Response capped at 100 tokens
messages=[{"role": "user", "content": "Write a novel"}]
)
# Forces brevity
Troubleshooting with Parameters
| Problem | Parameter | Adjustment |
|---|---|---|
| Output is random/inconsistent | temperature | Lower to 0.0-0.3 |
| Output is too generic/repetitive | temperature | Raise to 0.7-1.5 |
| Getting low-probability wrong answers | top_p | Lower to 0.5-0.8 |
| Response is too long | max_tokens | Set explicit limit |
| Model rambles or goes off-topic | max_tokens + temperature | Lower both |
A/B Testing Prompts Informally
Without extensive infrastructure, you can still compare prompts:
Simple A/B Test
Task: Generate email subject lines for a newsletter
VARIANT A:
"Generate 5 email subject lines for a tech newsletter"
VARIANT B:
"Generate 5 email subject lines for a tech newsletter.
Goal: High open rate (30%+)
Audience: Software engineers who value learning
Style: Specific insight or tip, not clickbait
Format: Numbered list"
Run both prompts 3 times each, note which variant produces better results.
Scoring Framework
Score outputs on criteria that matter:
Criteria: (1=bad, 5=excellent)
- Relevance to task: ___
- Clarity: ___
- Completeness: ___
- Usability: ___
VARIANT A average score: ___
VARIANT B average score: ___
Quick Comparison Script
import anthropic
client = anthropic.Anthropic()
def test_variant(variant_name, prompt):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
variant_a = "Write a poem about mountains"
variant_b = "Write a 12-line poem about mountain peaks using vivid imagery"
output_a = test_variant("A", variant_a)
output_b = test_variant("B", variant_b)
print("VARIANT A:")
print(output_a)
print("\nVARIANT B:")
print(output_b)
Key Takeaway
Prompt failures follow predictable patterns: misunderstanding, verbosity, hallucination, off-topic, wrong format. Use a systematic iteration loop: test → analyze → adjust → retest. Log your attempts to avoid regression and understand what works. Parameters like temperature and top_p fine-tune behavior. Even informal A/B testing helps you choose between prompt variants.
Exercise: Take a Failing Prompt, Diagnose and Fix It
In this exercise, you’ll practice the full debugging workflow.
The Failing Prompt
Prompt: "List reasons why people procrastinate"
Output:
"People procrastinate for various psychological and external reasons.
Some avoid tasks they find unpleasant or anxiety-inducing. Others struggle
with time management or perfectionism. External factors like distractions,
lack of resources, or unclear deadlines can also contribute. Additionally,
low self-discipline and difficulty prioritizing tasks are common factors..."
Your Task
-
Analyze the failure
- What’s wrong with this output?
- Is it a clarity issue? Format issue? Tone issue?
- What failure mode is this? (From the lesson)
-
Identify the root cause
- Why did the prompt produce this output?
- What did you NOT specify in the original prompt?
-
Create improved versions
- Write Prompt v2 with targeted fixes
- Write Prompt v3 with additional improvements
- For each version, explain what you changed and why
-
Test your prompts (optional but recommended)
- Run your improved prompts on an actual LLM
- Compare the outputs
- Which version is best? Why?
-
Document your process
- Write a short summary (200-300 words) of:
- The original failure and why it happened
- Your diagnostic process
- The fixes you applied
- How the final version is better
- Write a short summary (200-300 words) of:
Example Structure for Your Answer
## Original Analysis
Failure: Generic, over-explained response
Root cause: No constraints on format, depth, or audience
## Prompt v2 (Targeted Fix)
[New prompt + explanation of change]
Result: [What improved?]
## Prompt v3 (Further Refinement)
[Another iteration]
Result: [What improved further?]
## Final Assessment
The most effective version was: [v2/v3]
Because: [explanation]
Bonus Challenge
Take the original output and show how you would edit it into what the improved prompt would generate. This helps you understand the gap between vague and specific prompts.