Structured Output from Multimodal Inputs

Introduction

Extracting data from images, documents, or audio is only useful if you get structured, predictable output. A prompt that sometimes returns JSON and sometimes returns natural language is worthless in production. This lesson teaches you how to reliably extract structured data from multimodal inputs and validate it.

The key challenge: models are creative. They’ll interpret your instructions in ways you didn’t expect. You need to be explicit, provide examples, and validate ruthlessly.

Key Takeaway: Structured output requires clear schemas, exact format specifications, and validation. Never assume the model will format data the way you expect - always validate and provide fallback handling.

Schema-Guided Extraction

Define exactly what you want before extracting:

from dataclasses import dataclass
from typing import List, Optional
import json

@dataclass
class ExtractedReceipt:
    """Schema for receipt extraction"""
    store_name: str
    transaction_date: str
    items: List[dict]  # {"name": str, "price": float}
    subtotal: float
    tax: float
    total: float
    payment_method: Optional[str] = None

    def to_json(self) -> str:
        """Convert to JSON"""
        return json.dumps({
            'store_name': self.store_name,
            'transaction_date': self.transaction_date,
            'items': self.items,
            'subtotal': self.subtotal,
            'tax': self.tax,
            'total': self.total,
            'payment_method': self.payment_method
        })


def create_extraction_prompt(schema: dataclass, example: dict = None) -> str:
    """Create a prompt for structured extraction"""

    schema_str = json.dumps({
        field: type_.__name__
        for field, type_ in schema.__annotations__.items()
    }, indent=2)

    example_str = ""
    if example:
        example_str = f"""
EXAMPLE OUTPUT:
{json.dumps(example, indent=2)}"""

    return f"""Extract information from this image and return valid JSON.

SCHEMA (Required fields):
{schema_str}

RULES:
1. Return ONLY valid JSON, no other text
2. Use null for missing required fields
3. Omit optional fields if not present
4. Use proper data types (strings quoted, numbers not quoted)
5. Arrays should be JSON arrays
{example_str}

EXTRACTION RESULT:
"""


def extract_with_schema(image_path: str, schema: dataclass,
                       example: dict = None, model_fn=None) -> dict:
    """Extract data from image using schema guidance"""

    prompt = create_extraction_prompt(schema, example)

    response = model_fn(image_path, prompt)

    # Try to parse as JSON
    try:
        extracted = json.loads(response)
        return {
            'success': True,
            'data': extracted,
            'raw_response': response
        }
    except json.JSONDecodeError:
        return {
            'success': False,
            'error': 'Invalid JSON returned',
            'raw_response': response
        }

Validation and Post-Processing

Always validate extracted data:

from typing import Type, Dict
import jsonschema

class ExtractionValidator:
    """Validate extracted data against schema"""

    def __init__(self, schema_dict: dict):
        self.schema = schema_dict

    def validate(self, data: dict) -> dict:
        """Validate extracted data"""

        try:
            jsonschema.validate(instance=data, schema=self.schema)
            return {'valid': True, 'errors': []}
        except jsonschema.ValidationError as e:
            return {
                'valid': False,
                'errors': [str(e)],
                'failed_validation': e.message
            }

    def fix_common_errors(self, data: dict) -> dict:
        """Attempt to fix common extraction errors"""

        fixed = data.copy()

        # Fix string numbers to actual numbers
        for field, value in fixed.items():
            if isinstance(value, str) and value.replace('.', '', 1).isdigit():
                try:
                    fixed[field] = float(value) if '.' in value else int(value)
                except:
                    pass

        # Fix date formats
        if 'date' in fixed and isinstance(fixed['date'], str):
            fixed['date'] = self._normalize_date(fixed['date'])

        return fixed

    def _normalize_date(self, date_str: str) -> str:
        """Convert various date formats to YYYY-MM-DD"""

        from datetime import datetime

        formats = [
            '%m/%d/%Y',
            '%m/%d/%y',
            '%d/%m/%Y',
            '%Y-%m-%d',
            '%B %d, %Y',
            '%b %d, %Y'
        ]

        for fmt in formats:
            try:
                parsed = datetime.strptime(date_str, fmt)
                return parsed.strftime('%Y-%m-%d')
            except ValueError:
                continue

        return date_str  # Return original if can't parse


# Define schema for validation
receipt_schema = {
    "type": "object",
    "properties": {
        "store_name": {"type": "string"},
        "transaction_date": {"type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}$"},
        "items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"}
                }
            }
        },
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"}
    },
    "required": ["store_name", "total"]
}

validator = ExtractionValidator(receipt_schema)

# Usage
extracted = extract_with_schema(image_path, ExtractedReceipt)

if extracted['success']:
    validation = validator.validate(extracted['data'])

    if not validation['valid']:
        print(f"Validation errors: {validation['errors']}")

        # Try to fix
        fixed_data = validator.fix_common_errors(extracted['data'])
        validation = validator.validate(fixed_data)

        if validation['valid']:
            extracted['data'] = fixed_data
            extracted['auto_fixed'] = True

Handling Extraction Ambiguity

Real-world data is messy. You need strategies for ambiguity:

def extract_with_confidence(image_path: str, prompt: str,
                           model_fn=None) -> dict:
    """Extract data and report confidence"""

    # Ask the model to include confidence scores
    confidence_prompt = f"""{prompt}

For each extracted field, also include a confidence score (0-100):
{{
  "data": {{ ... extracted data ... }},
  "confidence": {{
    "field_name": score,
    ...
  }},
  "ambiguities": ["Any unclear items?"]
}}"""

    response = model_fn(image_path, confidence_prompt)

    try:
        result = json.loads(response)
        return result
    except:
        return {'error': 'Failed to parse response'}


def extract_with_alternatives(image_path: str, prompt: str,
                             model_fn=None) -> dict:
    """Extract data including alternative interpretations"""

    alt_prompt = f"""{prompt}

If there's any ambiguity, provide alternative interpretations:
{{
  "primary": {{ ... most likely interpretation ... }},
  "alternatives": [
    {{
      "interpretation": {{ ... alt 1 ... }},
      "likelihood": "possible|unlikely",
      "reason": "Why this interpretation?"
    }}
  ],
  "ambiguities_noted": ["What made this ambiguous?"]
}}"""

    response = model_fn(image_path, alt_prompt)

    try:
        return json.loads(response)
    except:
        return {'error': 'Failed to parse'}

Multi-Step Extraction Pipelines

For complex extraction, break it into steps:

class ExtractionPipeline:
    """Multi-step structured extraction pipeline"""

    def __init__(self):
        self.steps = []

    def add_step(self, name: str, prompt_fn, validation_fn=None):
        """Add extraction step"""
        self.steps.append({
            'name': name,
            'prompt_fn': prompt_fn,
            'validation_fn': validation_fn,
            'result': None
        })

    def run(self, input_data, model_fn) -> dict:
        """Run all steps in sequence"""

        results = []

        for step in self.steps:
            # Generate prompt
            prompt = step['prompt_fn'](input_data)

            # Extract
            extraction = model_fn(input_data, prompt)

            try:
                result = json.loads(extraction)
            except:
                result = {'error': f"Failed to parse step {step['name']}"}

            # Validate if validator provided
            if step['validation_fn'] and 'error' not in result:
                validation = step['validation_fn'](result)
                result['validation'] = validation

            step['result'] = result
            results.append({
                'step': step['name'],
                'result': result
            })

        return {
            'steps': results,
            'final_result': self._synthesize_results(results)
        }

    def _synthesize_results(self, step_results: list) -> dict:
        """Combine results from all steps"""

        # For receipt example: Combine items, totals, etc.
        combined = {}

        for step_result in step_results:
            if 'error' not in step_result['result']:
                combined.update(step_result['result'])

        return combined


# Example: Multi-step receipt extraction
def extract_receipt_multi_step(image_path: str, model_fn) -> dict:
    """Extract receipt using multi-step pipeline"""

    pipeline = ExtractionPipeline()

    # Step 1: Extract store info
    pipeline.add_step(
        name="Extract Store Information",
        prompt_fn=lambda _: """Extract store information:
{"store_name": "...", "address": "...", "phone": "..."}""",
        validation_fn=lambda x: 'store_name' in x
    )

    # Step 2: Extract items and prices
    pipeline.add_step(
        name="Extract Items",
        prompt_fn=lambda _: """Extract all items and prices as JSON array:
{"items": [{"name": "...", "price": 0.00}]}""",
        validation_fn=lambda x: isinstance(x.get('items'), list)
    )

    # Step 3: Extract totals
    pipeline.add_step(
        name="Extract Totals",
        prompt_fn=lambda _: """Extract financial totals:
{"subtotal": 0.00, "tax": 0.00, "total": 0.00}""",
        validation_fn=lambda x: 'total' in x
    )

    # Step 4: Extract metadata
    pipeline.add_step(
        name="Extract Metadata",
        prompt_fn=lambda _: """Extract receipt metadata:
{"date": "YYYY-MM-DD", "time": "HH:MM", "payment_method": "..."}""",
        validation_fn=lambda x: True  # Optional
    )

    return pipeline.run(image_path, model_fn)

Building Reliable Extraction Systems

class RobustExtractor:
    """Reliable extraction with retries and fallbacks"""

    def __init__(self, model_fn, max_retries: int = 3):
        self.model_fn = model_fn
        self.max_retries = max_retries

    def extract(self, input_data, prompt: str, schema: dict) -> dict:
        """Extract with retry logic"""

        for attempt in range(self.max_retries):
            try:
                # Try extraction
                response = self.model_fn(input_data, prompt)
                extracted = json.loads(response)

                # Validate
                validation = self._validate(extracted, schema)

                if validation['valid']:
                    return {
                        'success': True,
                        'data': extracted,
                        'attempts': attempt + 1
                    }

                # If invalid, try again with correction prompt
                if attempt < self.max_retries - 1:
                    prompt = self._create_correction_prompt(
                        prompt, validation['errors'], response
                    )

            except json.JSONDecodeError:
                if attempt < self.max_retries - 1:
                    # Add instruction to return valid JSON
                    prompt += "\n\nIMPORTANT: Return ONLY valid JSON"

        # Final attempt: Ask for minimal valid response
        try:
            response = self.model_fn(
                input_data,
                prompt + "\n\nReturn empty {} if unable to extract."
            )
            extracted = json.loads(response)

            return {
                'success': len(extracted) > 0,
                'data': extracted,
                'attempts': self.max_retries,
                'partial': True
            }
        except:
            return {
                'success': False,
                'data': None,
                'attempts': self.max_retries,
                'error': 'Could not extract data'
            }

    def _validate(self, data: dict, schema: dict) -> dict:
        """Validate data against schema"""
        try:
            jsonschema.validate(instance=data, schema=schema)
            return {'valid': True, 'errors': []}
        except jsonschema.ValidationError as e:
            return {'valid': False, 'errors': [str(e)]}

    def _create_correction_prompt(self, original_prompt: str,
                                 errors: list, last_response: str) -> str:
        """Create prompt with correction guidance"""

        return f"""{original_prompt}

PREVIOUS ATTEMPT HAD ERRORS:
{chr(10).join(f"- {e}" for e in errors)}

PREVIOUS RESPONSE:
{last_response}

CORRECTED RESPONSE (fix the errors above):"""

Exercise: Build End-to-End Extraction System

Create a complete structured extraction system:

Define schemas for 3 document types:
- Receipt/Invoice
- Business Card
- Contract/Agreement
For each schema:
- Create JSON Schema definition
- Write extraction prompt with examples
- Implement validation logic
- Add error correction logic
Build extraction pipeline:
- Multi-step extraction (for complex documents)
- Confidence scoring
- Alternative interpretations
Implement robustness:
- Retry logic
- Fallback handling
- Partial extraction support
Test and report:
- Run on 3-5 real documents per type
- Report success rate
- Document failure modes
- Suggest improvements

Deliverables:

3 JSON Schema definitions
3 extraction prompts (with examples)
Python extraction code
Validation and error handling
Test results with success rates
Documentation of limitations

Summary

In this lesson, you’ve learned:

How to define extraction schemas and guide models
Validation and error correction strategies
Handling ambiguous or unclear data
Multi-step extraction pipelines
Building robust extractors with retry logic
Complete end-to-end extraction systems

Next, you’ll learn how to build complete multimodal applications.