Vision Prompting — Working with Images
Vision Prompting — Working with Images
Introduction
Until recently, AI could only read text. Now, models like GPT-4 Vision, Claude 3, and Gemini can see and reason about images. This opens entirely new applications: analyzing documents, understanding diagrams, comparing images, extracting data from screenshots.
But vision prompting is different from text prompting. You can’t describe an image as well as showing it. An image contains details human language can miss or might poorly describe. This lesson teaches you how to write prompts that effectively guide vision models to understand images correctly.
Key Takeaway: When working with images, precision in describing what you want the model to focus on is crucial. A vague “analyze this image” might generate different output than “extract the table from this image and return as JSON” even though they look similar to you.
How Vision Models Process Images
Vision models work in stages:
1. Image Input → Model sees the actual image pixels
2. Visual Encoding → Model builds internal representation
3. Understanding → Model interprets what it sees
4. Text Generation → Model outputs text about what it understands
You influence stages 3 and 4 with your prompt:
def call_vision_model(image, prompt):
"""Call a vision model with an image"""
# The model gets BOTH the image and your prompt
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image # Raw image data
}
},
{
"type": "text",
"text": prompt # Your text prompt
}
]
}
]
)
return response.content[0].text
The key insight: the model can see the image, so your prompt doesn’t need to describe what’s in it. Instead, your prompt tells the model what to do with what it sees.
Effective Vision Prompts
Bad Vision Prompts
❌ Too vague:
"What do you see?"
❌ Redundant (asking model to describe what's obvious):
"I see a receipt. Extract the receipt data."
(The model can already see it's a receipt)
❌ Over-directive:
"This is a receipt for a store. The date is in the top left.
The total is at the bottom. Extract the total."
(If the model can see it, you don't need to describe layout)
Good Vision Prompts
✓ Task-focused:
"Extract all prices from this receipt as a JSON array.
Return: {"items": [{"name": "...", "price": 0.00}]}"
✓ Focuses on what you need, not what you see:
"Is this receipt from a grocery store or restaurant?
How can you tell?"
✓ Provides context for ambiguous cases:
"Extract the product code. If multiple codes are visible,
return the one on the barcode (not SKU or batch number)."
Common Vision Tasks
Task 1: Document Analysis
Extract structured data from documents (receipts, invoices, forms):
def extract_receipt_data(image_path: str) -> dict:
"""Extract receipt data as structured JSON"""
prompt = """Extract the following information from this receipt:
- Store name
- Transaction date
- Item names and prices (as array)
- Subtotal
- Tax amount
- Total amount
- Payment method
Return ONLY valid JSON in this format:
{
"store_name": "...",
"date": "YYYY-MM-DD",
"items": [
{"name": "...", "price": 0.00}
],
"subtotal": 0.00,
"tax": 0.00,
"total": 0.00,
"payment_method": "..."
}
If any field is not visible, omit it from the JSON.
If you can't determine a field confidently, omit it."""
return vision_model(image_path, prompt)
# Similarly for invoices:
def extract_invoice_data(image_path: str) -> dict:
"""Extract invoice information"""
prompt = """Extract invoice data as JSON:
{
"invoice_number": "...",
"invoice_date": "YYYY-MM-DD",
"customer_name": "...",
"vendor_name": "...",
"line_items": [
{"description": "...", "quantity": 0, "unit_price": 0.00, "total": 0.00}
],
"subtotal": 0.00,
"tax": 0.00,
"total": 0.00
}"""
return vision_model(image_path, prompt)
Task 2: OCR and Text Extraction
Extract text from images where the text is the main goal:
def extract_text_from_image(image_path: str, context: str = None) -> str:
"""Extract readable text from an image"""
base_prompt = "Extract all readable text from this image. Preserve formatting (lists, paragraphs) if possible."
if context:
prompt = f"{base_prompt}\n\nContext: {context}"
else:
prompt = base_prompt
return vision_model(image_path, prompt)
# For handwritten text:
def extract_handwriting(image_path: str) -> str:
"""Extract handwritten text"""
prompt = """Extract the handwritten text from this image.
- Do your best with unclear letters
- If a word is illegible, use [illegible]
- Preserve line breaks and spacing
- Note any corrections or cross-outs
Return the extracted text."""
return vision_model(image_path, prompt)
# For table extraction:
def extract_table_as_csv(image_path: str) -> str:
"""Extract table from image as CSV"""
prompt = """Extract the table from this image and return it as CSV format.
- First row should be headers
- Use commas to separate columns
- Escape quotes in cell values as \"
- Preserve the table structure exactly
Example output:
Name,Age,Email
John Smith,35,john@example.com
Jane Doe,28,jane@example.com"""
return vision_model(image_path, prompt)
Task 3: Image Comparison
Compare multiple images:
def compare_images(image1_path: str, image2_path: str) -> dict:
"""Compare two images"""
prompt = """Compare these two images:
1. What are the main differences?
2. What are the main similarities?
3. Are they from the same place/time/event?
4. What changed between the two?
Return structured analysis."""
# Note: Vision models can handle multiple images in one call
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image1}
},
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image2}
},
{
"type": "text",
"text": prompt
}
]
}
]
return vision_model(messages)
# For document version comparison:
def compare_document_versions(old_version: str, new_version: str) -> dict:
"""Find differences between document versions"""
prompt = """These are two versions of the same document.
What changed between them?
Return:
- Modified fields (show old and new values)
- Added sections
- Removed sections
- Formatting changes
Focus on substantive changes, ignore minor formatting."""
return compare_images(old_version, new_version)
Task 4: Chart and Graph Analysis
Understand visual data:
def analyze_chart(image_path: str, question: str = None) -> dict:
"""Analyze charts, graphs, or diagrams"""
base_prompt = """Analyze this chart/graph:
1. What type of chart is it? (bar, line, pie, etc.)
2. What is being measured?
3. What is the trend or key insight?
4. What are the main data points?
5. What's the most important takeaway?"""
if question:
prompt = f"{base_prompt}\n\nAlso answer this specific question: {question}"
else:
prompt = base_prompt
return vision_model(image_path, prompt)
# For technical diagrams:
def analyze_diagram(image_path: str) -> dict:
"""Analyze technical, architectural, or conceptual diagrams"""
prompt = """Analyze this diagram:
1. What is this diagram showing? (architecture, process, relationship, etc.)
2. What are the main components?
3. How do components relate to each other?
4. What would you improve in this diagram?
5. Describe the flow or sequence if applicable.
Return structured analysis."""
return vision_model(image_path, prompt)
OCR Limitations and Tips
Vision models aren’t perfect at text extraction. Here’s how to work with them:
def extract_text_with_quality_check(image_path: str) -> dict:
"""Extract text and flag confidence issues"""
prompt = """Extract all text from this image. For each piece of text:
1. Identify what it is (label, heading, body text, etc.)
2. Report confidence (high/medium/low)
3. If confidence is low, explain why
Return as JSON:
{
"extracted_text": [
{
"text": "...",
"type": "heading|label|body",
"confidence": "high|medium|low",
"reason_if_low": "..."
}
],
"overall_confidence": "high|medium|low",
"notes": "Any issues with extraction..."
}"""
return vision_model(image_path, prompt)
# For printed text at poor angles:
def extract_rotated_text(image_path: str) -> str:
"""Extract text from rotated or angled images"""
prompt = """Extract all text from this image. The image may be rotated
or at an angle. Do your best to:
1. Identify the correct reading direction
2. Extract text in proper reading order
3. Preserve paragraph structure
If the image is upside-down or severely rotated, still extract
the text (reading it as it should be read, not how it appears)."""
return vision_model(image_path, prompt)
# For low-quality or degraded text:
def extract_degraded_text(image_path: str) -> dict:
"""Extract text from low-quality images"""
prompt = """This image has low quality, fading, or degradation.
Extract what you can:
1. Attempt to read each word or character
2. Mark illegible portions as [ILLEGIBLE]
3. Flag your confidence (0-100%) at the top
4. Note what made extraction difficult
Example:
Confidence: 65%
Extracted: "The quick [ILLEGIBLE] fox jumps"
Difficulties: Text is faded, parts cut off"""
result = vision_model(image_path, prompt)
return {
'text': result,
'high_quality': False,
'recommendation': 'Manual review recommended'
}
Working with Screenshots and UI Analysis
Vision models excel at understanding user interfaces:
def analyze_screenshot(screenshot_path: str, task: str) -> dict:
"""Analyze a screenshot for UI testing or understanding"""
prompts = {
'find_element': "What is the location and description of the [element]?",
'understand_state': "What is the current state of this UI? What can the user do next?",
'find_bug': "Does this UI look correct? Are there any obvious bugs or errors?",
'extract_content': "What is the main content shown in this screenshot?",
'navigation': "How would you navigate to [destination] from this screen?"
}
prompt = prompts.get(task, "Analyze this screenshot")
return vision_model(screenshot_path, prompt)
# For automated testing:
def validate_ui_against_design(screenshot: str, design_mockup: str) -> dict:
"""Compare actual UI to design mockup"""
prompt = """Compare the actual screenshot to the design mockup:
1. Does the layout match the design?
2. Are colors correct?
3. Is text/content accurate?
4. Are all elements visible?
5. Are there any regressions?
Return findings as:
{
"matches_design": true/false,
"issues": [
{"type": "layout|color|content|missing", "description": "..."}
],
"critical": [list of critical issues]
}"""
messages = [
{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "data": screenshot}},
{"type": "image", "source": {"type": "base64", "data": design_mockup}},
{"type": "text", "text": prompt}
]
}
]
return vision_model(messages)
Best Practices for Vision Prompting
VISION PROMPT BEST PRACTICES:
1. BE SPECIFIC ABOUT WHAT YOU WANT
❌ "Analyze this image"
✓ "Extract the product SKU from this image"
2. PROVIDE CLEAR OUTPUT FORMAT
❌ "Tell me what you see"
✓ "Return JSON with keys: {sku, price, barcode}"
3. INCLUDE CONTEXT WHEN NEEDED
❌ "Is this a cat?"
✓ "Is this a house cat or a wild cat? How can you tell?"
4. HANDLE AMBIGUITY EXPLICITLY
❌ "What is this?"
✓ "If multiple values are visible, extract the largest one"
5. TEST WITH REAL IMAGES
Vision models see different things than text descriptions.
Always test with actual images, not descriptions.
6. ACCOUNT FOR IMAGE QUALITY
If you expect poor-quality images, say so:
"This image may be blurry. Do your best to..."
7. HANDLE EDGE CASES
"If the requested information isn't visible,
return null rather than guessing"
Exercise: Build Image Extraction Pipeline
Create a complete image processing system that:
-
Defines 3 different extraction types:
- Document extraction (receipts/invoices)
- Table extraction
- Chart analysis
-
For each type, write:
- A vision prompt (optimized for clarity)
- Validation logic (check output is valid JSON/format)
- Error handling (graceful degradation)
-
Test with at least 2 images per type:
- One high-quality image
- One challenging/low-quality image
-
Create a report showing:
- Original image
- Extracted data
- Confidence assessment
- Errors/limitations encountered
Deliverables:
- 3 vision prompts (400-600 words total)
- Python code for extraction
- Validation logic for each output format
- Test images (or descriptions)
- Results and analysis
- Notes on what worked and what didn’t
Summary
In this lesson, you’ve learned:
- How vision models process images and text together
- The difference between describing images and directing vision models
- Best practices for vision prompts across different task types
- How to handle OCR, document extraction, and image comparison
- UI analysis and testing with screenshots
- Error handling for poor-quality or ambiguous images
Next, you’ll learn about audio, video, and document prompting.