Intermediate

Building Multimodal Applications

Lesson 4 of 4 Estimated Time 50 min

Building Multimodal Applications

Introduction

Now you can work with images, audio, documents, and structured extraction. It’s time to put it all together: building real applications that combine these capabilities. A true multimodal application doesn’t just process images or documents or audio - it orchestrates them together.

This lesson teaches you architecture patterns for multimodal apps, real-world use cases, and optimization strategies.

Key Takeaway: Multimodal applications aren’t just bigger versions of single-modal apps. They require different architecture: preprocessing pipelines, modality-specific processing, and intelligent result synthesis.

Architecture Patterns for Multimodal Apps

Pattern 1: Sequential Processing

Process modalities one after another:

class SequentialMultimodalProcessor:
    """Process modalities in sequence"""

    def process(self, image: bytes, audio_transcript: str,
               document_text: str) -> dict:
        """Process all modalities sequentially"""

        results = {}

        # Step 1: Vision analysis
        results['vision'] = self._analyze_vision(image)

        # Step 2: Speech/language analysis
        results['speech'] = self._analyze_speech(audio_transcript)

        # Step 3: Document analysis
        results['document'] = self._analyze_document(document_text)

        # Step 4: Synthesize
        results['synthesis'] = self._synthesize(results)

        return results

    def _analyze_vision(self, image: bytes) -> dict:
        """Analyze visual content"""
        return vision_model(image, "Describe the visual content")

    def _analyze_speech(self, transcript: str) -> dict:
        """Analyze spoken content"""
        return language_model(f"Summarize: {transcript}")

    def _analyze_document(self, text: str) -> dict:
        """Analyze document content"""
        return language_model(f"Extract key points from: {text}")

    def _synthesize(self, results: dict) -> dict:
        """Combine insights from all modalities"""

        synthesis_prompt = f"""
Integrate these analyses:

Visual insights: {results['vision']}
Speech insights: {results['speech']}
Document insights: {results['document']}

What is the complete picture across all modalities?
Are there conflicts or complementary information?
"""

        return language_model(synthesis_prompt)

Pattern 2: Parallel Processing

Process modalities simultaneously when possible:

import concurrent.futures
from typing import Dict, Any

class ParallelMultimodalProcessor:
    """Process modalities in parallel for speed"""

    def process(self, image: bytes, audio: str, document: str,
               executor=None) -> dict:
        """Process modalities in parallel"""

        if executor is None:
            executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)

        # Submit all tasks
        vision_future = executor.submit(self._analyze_vision, image)
        speech_future = executor.submit(self._analyze_speech, audio)
        document_future = executor.submit(self._analyze_document, document)

        # Wait for all to complete
        results = {
            'vision': vision_future.result(),
            'speech': speech_future.result(),
            'document': document_future.result()
        }

        # Synthesize
        results['synthesis'] = self._synthesize(results)

        return results

    def _analyze_vision(self, image: bytes) -> dict:
        """Vision analysis"""
        return vision_model(image, "What's shown visually?")

    def _analyze_speech(self, transcript: str) -> dict:
        """Speech analysis"""
        return language_model(f"Analyze: {transcript}")

    def _analyze_document(self, text: str) -> dict:
        """Document analysis"""
        return language_model(f"Extract from: {text}")

    def _synthesize(self, results: Dict[str, Any]) -> dict:
        """Synthesize parallel results"""

        synthesis_prompt = f"""
Combine these parallel analyses:

Vision: {results['vision']}
Speech: {results['speech']}
Document: {results['document']}

Provide integrated insights"""

        return language_model(synthesis_prompt)

Pattern 3: Hierarchical Processing

Process strategically, using results to guide what to process next:

class HierarchicalMultimodalProcessor:
    """Smart routing based on content analysis"""

    def process(self, image: bytes, audio: str, document: str) -> dict:
        """Process hierarchically based on content"""

        # Step 1: Quick classification
        image_type = self._classify_image(image)
        speech_importance = self._assess_speech(audio)

        results = {
            'image_type': image_type,
            'speech_importance': speech_importance
        }

        # Step 2: Route to appropriate processors
        if image_type == 'document':
            # Vision analysis is critical
            results['vision'] = self._detailed_vision_analysis(image)
            results['speech'] = self._quick_speech_summary(audio)

        elif image_type == 'scene':
            # Vision and speech are equally important
            results['vision'] = self._detailed_vision_analysis(image)
            results['speech'] = self._detailed_speech_analysis(audio)

        else:
            # Speech is primary
            results['vision'] = self._quick_image_summary(image)
            results['speech'] = self._detailed_speech_analysis(audio)

        # Step 3: Optional document analysis if needed
        if speech_importance == 'critical':
            results['document'] = self._extract_from_document(document)

        # Step 4: Synthesize
        results['synthesis'] = self._smart_synthesis(results)

        return results

    def _classify_image(self, image: bytes) -> str:
        """Quickly classify what kind of image this is"""
        response = vision_model(image, "Is this a document, scene, or person? One word only.")
        return response.strip().lower()

    def _assess_speech(self, transcript: str) -> str:
        """Assess importance of speech content"""
        response = language_model(f"Is this speech critical? Answer: critical/important/background\n\n{transcript}")
        return response.strip().lower()

    def _detailed_vision_analysis(self, image: bytes) -> dict:
        """Full vision analysis"""
        return vision_model(image, "Comprehensive visual analysis...")

    def _quick_image_summary(self, image: bytes) -> dict:
        """Quick image summary"""
        return vision_model(image, "Brief summary of image in 1-2 sentences")

    def _detailed_speech_analysis(self, transcript: str) -> dict:
        """Full speech analysis"""
        return language_model(f"Detailed analysis: {transcript}")

    def _quick_speech_summary(self, transcript: str) -> dict:
        """Quick speech summary"""
        return language_model(f"Summarize in 1 sentence: {transcript}")

    def _extract_from_document(self, text: str) -> dict:
        """Extract key information from document"""
        return language_model(f"Extract key facts: {text}")

    def _smart_synthesis(self, results: dict) -> dict:
        """Synthesize based on what was processed"""
        # Combine available results intelligently
        summary = f"Integrated analysis based on {len(results)} modalities"
        return {'summary': summary}

Real-World Use Cases

Use Case 1: Automated Expense Reporting

class ExpenseReporter:
    """Multimodal expense extraction and reporting"""

    def process_expense(self, receipt_image: bytes,
                       category_context: str = None) -> dict:
        """Process a receipt and create expense report"""

        # Step 1: Extract receipt data
        extraction = self._extract_receipt(receipt_image)

        if not extraction['success']:
            return {'error': 'Could not extract receipt'}

        receipt_data = extraction['data']

        # Step 2: Categorize expense
        category = self._categorize_expense(receipt_data, category_context)

        # Step 3: Validate against policy
        validation = self._validate_policy(receipt_data, category)

        # Step 4: Format for submission
        report = {
            'date': receipt_data['date'],
            'merchant': receipt_data['store_name'],
            'category': category,
            'amount': receipt_data['total'],
            'items': receipt_data.get('items', []),
            'policy_compliant': validation['compliant'],
            'policy_warnings': validation['warnings']
        }

        return report

    def _extract_receipt(self, image: bytes) -> dict:
        """Extract structured receipt data"""
        prompt = """Extract receipt as JSON:
{"store_name": "...", "date": "YYYY-MM-DD", "total": 0.00, "items": [...]}"""
        response = vision_model(image, prompt)
        try:
            return {'success': True, 'data': json.loads(response)}
        except:
            return {'success': False}

    def _categorize_expense(self, receipt: dict, context: str) -> str:
        """Categorize the expense"""
        prompt = f"""Categorize this expense:
Store: {receipt['store_name']}
Items: {receipt.get('items', [])}
Context: {context or 'Unknown'}

Category: Travel/Food/Office/Other"""
        return language_model(prompt)

    def _validate_policy(self, receipt: dict, category: str) -> dict:
        """Validate against company policy"""
        policy = {
            'Food': {'max': 50, 'items_ok': ['restaurant', 'cafe', 'food']},
            'Travel': {'max': 500, 'items_ok': ['hotel', 'flight', 'taxi']},
            'Office': {'max': 100, 'items_ok': ['supplies', 'equipment']}
        }

        limit = policy.get(category, {}).get('max', 100)
        warnings = []

        if receipt['total'] > limit:
            warnings.append(f"Amount ${receipt['total']} exceeds limit ${limit}")

        return {
            'compliant': len(warnings) == 0,
            'warnings': warnings
        }

Use Case 2: Content Moderation

class ContentModerator:
    """Multimodal content moderation"""

    def moderate(self, image: bytes = None, text: str = None,
                audio_transcript: str = None) -> dict:
        """Moderate content across modalities"""

        moderation_results = {}

        # Analyze each modality
        if image:
            moderation_results['image'] = self._moderate_image(image)

        if text:
            moderation_results['text'] = self._moderate_text(text)

        if audio_transcript:
            moderation_results['audio'] = self._moderate_audio(audio_transcript)

        # Synthesize decision
        decision = self._make_moderation_decision(moderation_results)

        return {
            'per_modality': moderation_results,
            'action': decision['action'],
            'confidence': decision['confidence'],
            'reasons': decision['reasons']
        }

    def _moderate_image(self, image: bytes) -> dict:
        """Check image for policy violations"""
        response = vision_model(image,
            "Does this image violate content policy? Check for: hate, violence, NSFW")
        return {'violated': 'yes' in response.lower()}

    def _moderate_text(self, text: str) -> dict:
        """Check text for policy violations"""
        response = language_model(f"Does this text violate policy: {text}")
        return {'violated': 'yes' in response.lower()}

    def _moderate_audio(self, transcript: str) -> dict:
        """Check audio for policy violations"""
        response = language_model(f"Does this contain policy violations: {transcript}")
        return {'violated': 'yes' in response.lower()}

    def _make_moderation_decision(self, results: dict) -> dict:
        """Decide on moderation action"""

        violations = [r.get('violated') for r in results.values()]
        violation_count = sum(1 for v in violations if v)

        if violation_count == 0:
            return {'action': 'approve', 'confidence': 'high', 'reasons': []}

        elif violation_count == len(violations):
            return {'action': 'reject', 'confidence': 'high', 'reasons': [
                f"{modality} violated policy"
                for modality, result in results.items()
                if result.get('violated')
            ]}

        else:
            return {'action': 'review', 'confidence': 'medium', 'reasons': [
                f"{modality} violated policy"
                for modality, result in results.items()
                if result.get('violated')
            ]}

Use Case 3: Document Understanding and Q&A

class DocumentAssistant:
    """Multimodal document understanding"""

    def __init__(self, document_image: bytes, audio_explanation: str = None):
        self.document_image = document_image
        self.audio_explanation = audio_explanation

        # Extract information
        self.document_text = self._extract_text()
        self.visual_structure = self._analyze_structure()

    def answer_question(self, question: str) -> dict:
        """Answer questions about the document"""

        context_parts = [
            f"Document content:\n{self.document_text}",
            f"Visual structure:\n{self.visual_structure}"
        ]

        if self.audio_explanation:
            context_parts.append(f"Additional context:\n{self.audio_explanation}")

        prompt = f"""Based on this document:

{chr(10).join(context_parts)}

Question: {question}

Answer with specific references to the document."""

        answer = language_model(prompt)

        return {
            'question': question,
            'answer': answer,
            'confidence': self._assess_confidence(answer)
        }

    def extract_key_information(self) -> dict:
        """Extract all key information from document"""

        prompt = f"""Extract key information from this document:

{self.document_text}

Visual structure hints: {self.visual_structure}

Return JSON with all important data"""

        return json.loads(language_model(prompt))

    def _extract_text(self) -> str:
        """Extract text from document"""
        return vision_model(self.document_image, "Extract all text from this document")

    def _analyze_structure(self) -> str:
        """Analyze document structure"""
        return vision_model(self.document_image,
            "Describe the visual structure: sections, headings, tables, etc")

    def _assess_confidence(self, answer: str) -> str:
        """Assess confidence in answer"""
        if 'not' in answer.lower() or 'unclear' in answer.lower():
            return 'low'
        return 'high'

Performance Optimization

Token Budget Management

class TokenBudgetManager:
    """Manage token usage across modalities"""

    def __init__(self, budget_per_request: int = 4000):
        self.budget = budget_per_request
        self.used = 0

    def allocate_for_modality(self, modality: str,
                             num_items: int = 1) -> int:
        """Allocate tokens for a modality"""

        allocations = {
            'image': 500 * num_items,
            'audio': 300 * num_items,
            'document': 400 * num_items,
            'synthesis': 500
        }

        allocation = allocations.get(modality, 400)

        if self.used + allocation > self.budget:
            # Reduce quality to fit budget
            reduction_factor = (self.budget - self.used) / allocation
            return int(allocation * reduction_factor)

        self.used += allocation
        return allocation

    def remaining_budget(self) -> int:
        """Get remaining token budget"""
        return max(0, self.budget - self.used)

    def should_process_modality(self, modality: str) -> bool:
        """Check if we have budget for modality"""
        return self.remaining_budget() > 100


# Usage
budget = TokenBudgetManager(budget_per_request=4000)

# Vision takes 500 tokens
if budget.should_process_modality('image'):
    vision_tokens = budget.allocate_for_modality('image')
    # Process image

# Audio takes 300 tokens
if budget.should_process_modality('audio'):
    audio_tokens = budget.allocate_for_modality('audio')
    # Process audio

print(f"Remaining budget: {budget.remaining_budget()} tokens")

Caching Multimodal Results

import hashlib

class MultimodalCache:
    """Cache results across modalities"""

    def __init__(self):
        self.cache = {}

    def _hash_input(self, image: bytes = None, text: str = None,
                   audio: str = None) -> str:
        """Create hash of inputs"""

        combined = ""
        if image:
            combined += hashlib.md5(image).hexdigest()
        if text:
            combined += text
        if audio:
            combined += audio

        return hashlib.md5(combined.encode()).hexdigest()

    def get(self, image: bytes = None, text: str = None,
           audio: str = None) -> dict:
        """Retrieve cached result"""

        key = self._hash_input(image, text, audio)
        return self.cache.get(key)

    def set(self, result: dict, image: bytes = None,
           text: str = None, audio: str = None):
        """Cache result"""

        key = self._hash_input(image, text, audio)
        self.cache[key] = result

Exercise: Build a Multimodal Application

Create a complete multimodal application for one of these:

  1. Receipt & Expense Analyzer:

    • Extract receipt image data
    • Validate against policy
    • Generate expense report
    • Support optional voice notes
  2. Document Interrogator:

    • Load a PDF/images
    • Support optional audio explanation
    • Answer arbitrary questions
    • Extract structured info
  3. Content Reviewer:

    • Accept image, text, optional audio
    • Run modality-specific checks
    • Synthesize moderation decision
    • Report violations

Requirements:

For your chosen use case:

  • Design architecture (sequential/parallel/hierarchical)
  • Implement all modalities
  • Create test cases
  • Optimize for performance
  • Include error handling
  • Generate sample output

Deliverables:

  • Complete application code
  • Architecture diagram
  • Test results (3-5 examples)
  • Performance metrics
  • Documentation

Summary

In this lesson, you’ve learned:

  • Architecture patterns for multimodal processing
  • Sequential, parallel, and hierarchical approaches
  • Real-world use cases and implementations
  • Performance optimization and token budgeting
  • Caching strategies for multimodal results
  • Complete end-to-end multimodal applications

You’ve completed the entire Intermediate phase of Prompt Engineering. You now understand:

  • How to measure and optimize prompts systematically
  • How to design system prompts for complex behaviors
  • How to work with multiple modalities effectively
  • How to build production systems with these capabilities

Next phase: Advanced techniques and specialized applications.