Building Multimodal Applications

Introduction

Now you can work with images, audio, documents, and structured extraction. It’s time to put it all together: building real applications that combine these capabilities. A true multimodal application doesn’t just process images or documents or audio - it orchestrates them together.

This lesson teaches you architecture patterns for multimodal apps, real-world use cases, and optimization strategies.

Key Takeaway: Multimodal applications aren’t just bigger versions of single-modal apps. They require different architecture: preprocessing pipelines, modality-specific processing, and intelligent result synthesis.

Architecture Patterns for Multimodal Apps

Pattern 1: Sequential Processing

Process modalities one after another:

class SequentialMultimodalProcessor:
    """Process modalities in sequence"""

    def process(self, image: bytes, audio_transcript: str,
               document_text: str) -> dict:
        """Process all modalities sequentially"""

        results = {}

        # Step 1: Vision analysis
        results['vision'] = self._analyze_vision(image)

        # Step 2: Speech/language analysis
        results['speech'] = self._analyze_speech(audio_transcript)

        # Step 3: Document analysis
        results['document'] = self._analyze_document(document_text)

        # Step 4: Synthesize
        results['synthesis'] = self._synthesize(results)

        return results

    def _analyze_vision(self, image: bytes) -> dict:
        """Analyze visual content"""
        return vision_model(image, "Describe the visual content")

    def _analyze_speech(self, transcript: str) -> dict:
        """Analyze spoken content"""
        return language_model(f"Summarize: {transcript}")

    def _analyze_document(self, text: str) -> dict:
        """Analyze document content"""
        return language_model(f"Extract key points from: {text}")

    def _synthesize(self, results: dict) -> dict:
        """Combine insights from all modalities"""

        synthesis_prompt = f"""
Integrate these analyses:

Visual insights: {results['vision']}
Speech insights: {results['speech']}
Document insights: {results['document']}

What is the complete picture across all modalities?
Are there conflicts or complementary information?
"""

        return language_model(synthesis_prompt)

Pattern 2: Parallel Processing

Process modalities simultaneously when possible:

import concurrent.futures
from typing import Dict, Any

class ParallelMultimodalProcessor:
    """Process modalities in parallel for speed"""

    def process(self, image: bytes, audio: str, document: str,
               executor=None) -> dict:
        """Process modalities in parallel"""

        if executor is None:
            executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)

        # Submit all tasks
        vision_future = executor.submit(self._analyze_vision, image)
        speech_future = executor.submit(self._analyze_speech, audio)
        document_future = executor.submit(self._analyze_document, document)

        # Wait for all to complete
        results = {
            'vision': vision_future.result(),
            'speech': speech_future.result(),
            'document': document_future.result()
        }

        # Synthesize
        results['synthesis'] = self._synthesize(results)

        return results

    def _analyze_vision(self, image: bytes) -> dict:
        """Vision analysis"""
        return vision_model(image, "What's shown visually?")

    def _analyze_speech(self, transcript: str) -> dict:
        """Speech analysis"""
        return language_model(f"Analyze: {transcript}")

    def _analyze_document(self, text: str) -> dict:
        """Document analysis"""
        return language_model(f"Extract from: {text}")

    def _synthesize(self, results: Dict[str, Any]) -> dict:
        """Synthesize parallel results"""

        synthesis_prompt = f"""
Combine these parallel analyses:

Vision: {results['vision']}
Speech: {results['speech']}
Document: {results['document']}

Provide integrated insights"""

        return language_model(synthesis_prompt)

Pattern 3: Hierarchical Processing

Process strategically, using results to guide what to process next:

class HierarchicalMultimodalProcessor:
    """Smart routing based on content analysis"""

    def process(self, image: bytes, audio: str, document: str) -> dict:
        """Process hierarchically based on content"""

        # Step 1: Quick classification
        image_type = self._classify_image(image)
        speech_importance = self._assess_speech(audio)

        results = {
            'image_type': image_type,
            'speech_importance': speech_importance
        }

        # Step 2: Route to appropriate processors
        if image_type == 'document':
            # Vision analysis is critical
            results['vision'] = self._detailed_vision_analysis(image)
            results['speech'] = self._quick_speech_summary(audio)

        elif image_type == 'scene':
            # Vision and speech are equally important
            results['vision'] = self._detailed_vision_analysis(image)
            results['speech'] = self._detailed_speech_analysis(audio)

        else:
            # Speech is primary
            results['vision'] = self._quick_image_summary(image)
            results['speech'] = self._detailed_speech_analysis(audio)

        # Step 3: Optional document analysis if needed
        if speech_importance == 'critical':
            results['document'] = self._extract_from_document(document)

        # Step 4: Synthesize
        results['synthesis'] = self._smart_synthesis(results)

        return results

    def _classify_image(self, image: bytes) -> str:
        """Quickly classify what kind of image this is"""
        response = vision_model(image, "Is this a document, scene, or person? One word only.")
        return response.strip().lower()

    def _assess_speech(self, transcript: str) -> str:
        """Assess importance of speech content"""
        response = language_model(f"Is this speech critical? Answer: critical/important/background\n\n{transcript}")
        return response.strip().lower()

    def _detailed_vision_analysis(self, image: bytes) -> dict:
        """Full vision analysis"""
        return vision_model(image, "Comprehensive visual analysis...")

    def _quick_image_summary(self, image: bytes) -> dict:
        """Quick image summary"""
        return vision_model(image, "Brief summary of image in 1-2 sentences")

    def _detailed_speech_analysis(self, transcript: str) -> dict:
        """Full speech analysis"""
        return language_model(f"Detailed analysis: {transcript}")

    def _quick_speech_summary(self, transcript: str) -> dict:
        """Quick speech summary"""
        return language_model(f"Summarize in 1 sentence: {transcript}")

    def _extract_from_document(self, text: str) -> dict:
        """Extract key information from document"""
        return language_model(f"Extract key facts: {text}")

    def _smart_synthesis(self, results: dict) -> dict:
        """Synthesize based on what was processed"""
        # Combine available results intelligently
        summary = f"Integrated analysis based on {len(results)} modalities"
        return {'summary': summary}

Real-World Use Cases

Use Case 1: Automated Expense Reporting

class ExpenseReporter:
    """Multimodal expense extraction and reporting"""

    def process_expense(self, receipt_image: bytes,
                       category_context: str = None) -> dict:
        """Process a receipt and create expense report"""

        # Step 1: Extract receipt data
        extraction = self._extract_receipt(receipt_image)

        if not extraction['success']:
            return {'error': 'Could not extract receipt'}

        receipt_data = extraction['data']

        # Step 2: Categorize expense
        category = self._categorize_expense(receipt_data, category_context)

        # Step 3: Validate against policy
        validation = self._validate_policy(receipt_data, category)

        # Step 4: Format for submission
        report = {
            'date': receipt_data['date'],
            'merchant': receipt_data['store_name'],
            'category': category,
            'amount': receipt_data['total'],
            'items': receipt_data.get('items', []),
            'policy_compliant': validation['compliant'],
            'policy_warnings': validation['warnings']
        }

        return report

    def _extract_receipt(self, image: bytes) -> dict:
        """Extract structured receipt data"""
        prompt = """Extract receipt as JSON:
{"store_name": "...", "date": "YYYY-MM-DD", "total": 0.00, "items": [...]}"""
        response = vision_model(image, prompt)
        try:
            return {'success': True, 'data': json.loads(response)}
        except:
            return {'success': False}

    def _categorize_expense(self, receipt: dict, context: str) -> str:
        """Categorize the expense"""
        prompt = f"""Categorize this expense:
Store: {receipt['store_name']}
Items: {receipt.get('items', [])}
Context: {context or 'Unknown'}

Category: Travel/Food/Office/Other"""
        return language_model(prompt)

    def _validate_policy(self, receipt: dict, category: str) -> dict:
        """Validate against company policy"""
        policy = {
            'Food': {'max': 50, 'items_ok': ['restaurant', 'cafe', 'food']},
            'Travel': {'max': 500, 'items_ok': ['hotel', 'flight', 'taxi']},
            'Office': {'max': 100, 'items_ok': ['supplies', 'equipment']}
        }

        limit = policy.get(category, {}).get('max', 100)
        warnings = []

        if receipt['total'] > limit:
            warnings.append(f"Amount ${receipt['total']} exceeds limit ${limit}")

        return {
            'compliant': len(warnings) == 0,
            'warnings': warnings
        }

Use Case 2: Content Moderation

class ContentModerator:
    """Multimodal content moderation"""

    def moderate(self, image: bytes = None, text: str = None,
                audio_transcript: str = None) -> dict:
        """Moderate content across modalities"""

        moderation_results = {}

        # Analyze each modality
        if image:
            moderation_results['image'] = self._moderate_image(image)

        if text:
            moderation_results['text'] = self._moderate_text(text)

        if audio_transcript:
            moderation_results['audio'] = self._moderate_audio(audio_transcript)

        # Synthesize decision
        decision = self._make_moderation_decision(moderation_results)

        return {
            'per_modality': moderation_results,
            'action': decision['action'],
            'confidence': decision['confidence'],
            'reasons': decision['reasons']
        }

    def _moderate_image(self, image: bytes) -> dict:
        """Check image for policy violations"""
        response = vision_model(image,
            "Does this image violate content policy? Check for: hate, violence, NSFW")
        return {'violated': 'yes' in response.lower()}

    def _moderate_text(self, text: str) -> dict:
        """Check text for policy violations"""
        response = language_model(f"Does this text violate policy: {text}")
        return {'violated': 'yes' in response.lower()}

    def _moderate_audio(self, transcript: str) -> dict:
        """Check audio for policy violations"""
        response = language_model(f"Does this contain policy violations: {transcript}")
        return {'violated': 'yes' in response.lower()}

    def _make_moderation_decision(self, results: dict) -> dict:
        """Decide on moderation action"""

        violations = [r.get('violated') for r in results.values()]
        violation_count = sum(1 for v in violations if v)

        if violation_count == 0:
            return {'action': 'approve', 'confidence': 'high', 'reasons': []}

        elif violation_count == len(violations):
            return {'action': 'reject', 'confidence': 'high', 'reasons': [
                f"{modality} violated policy"
                for modality, result in results.items()
                if result.get('violated')
            ]}

        else:
            return {'action': 'review', 'confidence': 'medium', 'reasons': [
                f"{modality} violated policy"
                for modality, result in results.items()
                if result.get('violated')
            ]}

Use Case 3: Document Understanding and Q&A

class DocumentAssistant:
    """Multimodal document understanding"""

    def __init__(self, document_image: bytes, audio_explanation: str = None):
        self.document_image = document_image
        self.audio_explanation = audio_explanation

        # Extract information
        self.document_text = self._extract_text()
        self.visual_structure = self._analyze_structure()

    def answer_question(self, question: str) -> dict:
        """Answer questions about the document"""

        context_parts = [
            f"Document content:\n{self.document_text}",
            f"Visual structure:\n{self.visual_structure}"
        ]

        if self.audio_explanation:
            context_parts.append(f"Additional context:\n{self.audio_explanation}")

        prompt = f"""Based on this document:

{chr(10).join(context_parts)}

Question: {question}

Answer with specific references to the document."""

        answer = language_model(prompt)

        return {
            'question': question,
            'answer': answer,
            'confidence': self._assess_confidence(answer)
        }

    def extract_key_information(self) -> dict:
        """Extract all key information from document"""

        prompt = f"""Extract key information from this document:

{self.document_text}

Visual structure hints: {self.visual_structure}

Return JSON with all important data"""

        return json.loads(language_model(prompt))

    def _extract_text(self) -> str:
        """Extract text from document"""
        return vision_model(self.document_image, "Extract all text from this document")

    def _analyze_structure(self) -> str:
        """Analyze document structure"""
        return vision_model(self.document_image,
            "Describe the visual structure: sections, headings, tables, etc")

    def _assess_confidence(self, answer: str) -> str:
        """Assess confidence in answer"""
        if 'not' in answer.lower() or 'unclear' in answer.lower():
            return 'low'
        return 'high'

Performance Optimization

Token Budget Management

class TokenBudgetManager:
    """Manage token usage across modalities"""

    def __init__(self, budget_per_request: int = 4000):
        self.budget = budget_per_request
        self.used = 0

    def allocate_for_modality(self, modality: str,
                             num_items: int = 1) -> int:
        """Allocate tokens for a modality"""

        allocations = {
            'image': 500 * num_items,
            'audio': 300 * num_items,
            'document': 400 * num_items,
            'synthesis': 500
        }

        allocation = allocations.get(modality, 400)

        if self.used + allocation > self.budget:
            # Reduce quality to fit budget
            reduction_factor = (self.budget - self.used) / allocation
            return int(allocation * reduction_factor)

        self.used += allocation
        return allocation

    def remaining_budget(self) -> int:
        """Get remaining token budget"""
        return max(0, self.budget - self.used)

    def should_process_modality(self, modality: str) -> bool:
        """Check if we have budget for modality"""
        return self.remaining_budget() > 100


# Usage
budget = TokenBudgetManager(budget_per_request=4000)

# Vision takes 500 tokens
if budget.should_process_modality('image'):
    vision_tokens = budget.allocate_for_modality('image')
    # Process image

# Audio takes 300 tokens
if budget.should_process_modality('audio'):
    audio_tokens = budget.allocate_for_modality('audio')
    # Process audio

print(f"Remaining budget: {budget.remaining_budget()} tokens")

Caching Multimodal Results

import hashlib

class MultimodalCache:
    """Cache results across modalities"""

    def __init__(self):
        self.cache = {}

    def _hash_input(self, image: bytes = None, text: str = None,
                   audio: str = None) -> str:
        """Create hash of inputs"""

        combined = ""
        if image:
            combined += hashlib.md5(image).hexdigest()
        if text:
            combined += text
        if audio:
            combined += audio

        return hashlib.md5(combined.encode()).hexdigest()

    def get(self, image: bytes = None, text: str = None,
           audio: str = None) -> dict:
        """Retrieve cached result"""

        key = self._hash_input(image, text, audio)
        return self.cache.get(key)

    def set(self, result: dict, image: bytes = None,
           text: str = None, audio: str = None):
        """Cache result"""

        key = self._hash_input(image, text, audio)
        self.cache[key] = result

Exercise: Build a Multimodal Application

Create a complete multimodal application for one of these:

Receipt & Expense Analyzer:
- Extract receipt image data
- Validate against policy
- Generate expense report
- Support optional voice notes
Document Interrogator:
- Load a PDF/images
- Support optional audio explanation
- Answer arbitrary questions
- Extract structured info
Content Reviewer:
- Accept image, text, optional audio
- Run modality-specific checks
- Synthesize moderation decision
- Report violations

Requirements:

For your chosen use case:

Design architecture (sequential/parallel/hierarchical)
Implement all modalities
Create test cases
Optimize for performance
Include error handling
Generate sample output

Deliverables:

Complete application code
Architecture diagram
Test results (3-5 examples)
Performance metrics
Documentation

Summary

In this lesson, you’ve learned:

Architecture patterns for multimodal processing
Sequential, parallel, and hierarchical approaches
Real-world use cases and implementations
Performance optimization and token budgeting
Caching strategies for multimodal results
Complete end-to-end multimodal applications

You’ve completed the entire Intermediate phase of Prompt Engineering. You now understand:

How to measure and optimize prompts systematically
How to design system prompts for complex behaviors
How to work with multiple modalities effectively
How to build production systems with these capabilities

Next phase: Advanced techniques and specialized applications.