Building Multimodal Applications
Building Multimodal Applications
Introduction
Now you can work with images, audio, documents, and structured extraction. It’s time to put it all together: building real applications that combine these capabilities. A true multimodal application doesn’t just process images or documents or audio - it orchestrates them together.
This lesson teaches you architecture patterns for multimodal apps, real-world use cases, and optimization strategies.
Key Takeaway: Multimodal applications aren’t just bigger versions of single-modal apps. They require different architecture: preprocessing pipelines, modality-specific processing, and intelligent result synthesis.
Architecture Patterns for Multimodal Apps
Pattern 1: Sequential Processing
Process modalities one after another:
class SequentialMultimodalProcessor:
"""Process modalities in sequence"""
def process(self, image: bytes, audio_transcript: str,
document_text: str) -> dict:
"""Process all modalities sequentially"""
results = {}
# Step 1: Vision analysis
results['vision'] = self._analyze_vision(image)
# Step 2: Speech/language analysis
results['speech'] = self._analyze_speech(audio_transcript)
# Step 3: Document analysis
results['document'] = self._analyze_document(document_text)
# Step 4: Synthesize
results['synthesis'] = self._synthesize(results)
return results
def _analyze_vision(self, image: bytes) -> dict:
"""Analyze visual content"""
return vision_model(image, "Describe the visual content")
def _analyze_speech(self, transcript: str) -> dict:
"""Analyze spoken content"""
return language_model(f"Summarize: {transcript}")
def _analyze_document(self, text: str) -> dict:
"""Analyze document content"""
return language_model(f"Extract key points from: {text}")
def _synthesize(self, results: dict) -> dict:
"""Combine insights from all modalities"""
synthesis_prompt = f"""
Integrate these analyses:
Visual insights: {results['vision']}
Speech insights: {results['speech']}
Document insights: {results['document']}
What is the complete picture across all modalities?
Are there conflicts or complementary information?
"""
return language_model(synthesis_prompt)
Pattern 2: Parallel Processing
Process modalities simultaneously when possible:
import concurrent.futures
from typing import Dict, Any
class ParallelMultimodalProcessor:
"""Process modalities in parallel for speed"""
def process(self, image: bytes, audio: str, document: str,
executor=None) -> dict:
"""Process modalities in parallel"""
if executor is None:
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
# Submit all tasks
vision_future = executor.submit(self._analyze_vision, image)
speech_future = executor.submit(self._analyze_speech, audio)
document_future = executor.submit(self._analyze_document, document)
# Wait for all to complete
results = {
'vision': vision_future.result(),
'speech': speech_future.result(),
'document': document_future.result()
}
# Synthesize
results['synthesis'] = self._synthesize(results)
return results
def _analyze_vision(self, image: bytes) -> dict:
"""Vision analysis"""
return vision_model(image, "What's shown visually?")
def _analyze_speech(self, transcript: str) -> dict:
"""Speech analysis"""
return language_model(f"Analyze: {transcript}")
def _analyze_document(self, text: str) -> dict:
"""Document analysis"""
return language_model(f"Extract from: {text}")
def _synthesize(self, results: Dict[str, Any]) -> dict:
"""Synthesize parallel results"""
synthesis_prompt = f"""
Combine these parallel analyses:
Vision: {results['vision']}
Speech: {results['speech']}
Document: {results['document']}
Provide integrated insights"""
return language_model(synthesis_prompt)
Pattern 3: Hierarchical Processing
Process strategically, using results to guide what to process next:
class HierarchicalMultimodalProcessor:
"""Smart routing based on content analysis"""
def process(self, image: bytes, audio: str, document: str) -> dict:
"""Process hierarchically based on content"""
# Step 1: Quick classification
image_type = self._classify_image(image)
speech_importance = self._assess_speech(audio)
results = {
'image_type': image_type,
'speech_importance': speech_importance
}
# Step 2: Route to appropriate processors
if image_type == 'document':
# Vision analysis is critical
results['vision'] = self._detailed_vision_analysis(image)
results['speech'] = self._quick_speech_summary(audio)
elif image_type == 'scene':
# Vision and speech are equally important
results['vision'] = self._detailed_vision_analysis(image)
results['speech'] = self._detailed_speech_analysis(audio)
else:
# Speech is primary
results['vision'] = self._quick_image_summary(image)
results['speech'] = self._detailed_speech_analysis(audio)
# Step 3: Optional document analysis if needed
if speech_importance == 'critical':
results['document'] = self._extract_from_document(document)
# Step 4: Synthesize
results['synthesis'] = self._smart_synthesis(results)
return results
def _classify_image(self, image: bytes) -> str:
"""Quickly classify what kind of image this is"""
response = vision_model(image, "Is this a document, scene, or person? One word only.")
return response.strip().lower()
def _assess_speech(self, transcript: str) -> str:
"""Assess importance of speech content"""
response = language_model(f"Is this speech critical? Answer: critical/important/background\n\n{transcript}")
return response.strip().lower()
def _detailed_vision_analysis(self, image: bytes) -> dict:
"""Full vision analysis"""
return vision_model(image, "Comprehensive visual analysis...")
def _quick_image_summary(self, image: bytes) -> dict:
"""Quick image summary"""
return vision_model(image, "Brief summary of image in 1-2 sentences")
def _detailed_speech_analysis(self, transcript: str) -> dict:
"""Full speech analysis"""
return language_model(f"Detailed analysis: {transcript}")
def _quick_speech_summary(self, transcript: str) -> dict:
"""Quick speech summary"""
return language_model(f"Summarize in 1 sentence: {transcript}")
def _extract_from_document(self, text: str) -> dict:
"""Extract key information from document"""
return language_model(f"Extract key facts: {text}")
def _smart_synthesis(self, results: dict) -> dict:
"""Synthesize based on what was processed"""
# Combine available results intelligently
summary = f"Integrated analysis based on {len(results)} modalities"
return {'summary': summary}
Real-World Use Cases
Use Case 1: Automated Expense Reporting
class ExpenseReporter:
"""Multimodal expense extraction and reporting"""
def process_expense(self, receipt_image: bytes,
category_context: str = None) -> dict:
"""Process a receipt and create expense report"""
# Step 1: Extract receipt data
extraction = self._extract_receipt(receipt_image)
if not extraction['success']:
return {'error': 'Could not extract receipt'}
receipt_data = extraction['data']
# Step 2: Categorize expense
category = self._categorize_expense(receipt_data, category_context)
# Step 3: Validate against policy
validation = self._validate_policy(receipt_data, category)
# Step 4: Format for submission
report = {
'date': receipt_data['date'],
'merchant': receipt_data['store_name'],
'category': category,
'amount': receipt_data['total'],
'items': receipt_data.get('items', []),
'policy_compliant': validation['compliant'],
'policy_warnings': validation['warnings']
}
return report
def _extract_receipt(self, image: bytes) -> dict:
"""Extract structured receipt data"""
prompt = """Extract receipt as JSON:
{"store_name": "...", "date": "YYYY-MM-DD", "total": 0.00, "items": [...]}"""
response = vision_model(image, prompt)
try:
return {'success': True, 'data': json.loads(response)}
except:
return {'success': False}
def _categorize_expense(self, receipt: dict, context: str) -> str:
"""Categorize the expense"""
prompt = f"""Categorize this expense:
Store: {receipt['store_name']}
Items: {receipt.get('items', [])}
Context: {context or 'Unknown'}
Category: Travel/Food/Office/Other"""
return language_model(prompt)
def _validate_policy(self, receipt: dict, category: str) -> dict:
"""Validate against company policy"""
policy = {
'Food': {'max': 50, 'items_ok': ['restaurant', 'cafe', 'food']},
'Travel': {'max': 500, 'items_ok': ['hotel', 'flight', 'taxi']},
'Office': {'max': 100, 'items_ok': ['supplies', 'equipment']}
}
limit = policy.get(category, {}).get('max', 100)
warnings = []
if receipt['total'] > limit:
warnings.append(f"Amount ${receipt['total']} exceeds limit ${limit}")
return {
'compliant': len(warnings) == 0,
'warnings': warnings
}
Use Case 2: Content Moderation
class ContentModerator:
"""Multimodal content moderation"""
def moderate(self, image: bytes = None, text: str = None,
audio_transcript: str = None) -> dict:
"""Moderate content across modalities"""
moderation_results = {}
# Analyze each modality
if image:
moderation_results['image'] = self._moderate_image(image)
if text:
moderation_results['text'] = self._moderate_text(text)
if audio_transcript:
moderation_results['audio'] = self._moderate_audio(audio_transcript)
# Synthesize decision
decision = self._make_moderation_decision(moderation_results)
return {
'per_modality': moderation_results,
'action': decision['action'],
'confidence': decision['confidence'],
'reasons': decision['reasons']
}
def _moderate_image(self, image: bytes) -> dict:
"""Check image for policy violations"""
response = vision_model(image,
"Does this image violate content policy? Check for: hate, violence, NSFW")
return {'violated': 'yes' in response.lower()}
def _moderate_text(self, text: str) -> dict:
"""Check text for policy violations"""
response = language_model(f"Does this text violate policy: {text}")
return {'violated': 'yes' in response.lower()}
def _moderate_audio(self, transcript: str) -> dict:
"""Check audio for policy violations"""
response = language_model(f"Does this contain policy violations: {transcript}")
return {'violated': 'yes' in response.lower()}
def _make_moderation_decision(self, results: dict) -> dict:
"""Decide on moderation action"""
violations = [r.get('violated') for r in results.values()]
violation_count = sum(1 for v in violations if v)
if violation_count == 0:
return {'action': 'approve', 'confidence': 'high', 'reasons': []}
elif violation_count == len(violations):
return {'action': 'reject', 'confidence': 'high', 'reasons': [
f"{modality} violated policy"
for modality, result in results.items()
if result.get('violated')
]}
else:
return {'action': 'review', 'confidence': 'medium', 'reasons': [
f"{modality} violated policy"
for modality, result in results.items()
if result.get('violated')
]}
Use Case 3: Document Understanding and Q&A
class DocumentAssistant:
"""Multimodal document understanding"""
def __init__(self, document_image: bytes, audio_explanation: str = None):
self.document_image = document_image
self.audio_explanation = audio_explanation
# Extract information
self.document_text = self._extract_text()
self.visual_structure = self._analyze_structure()
def answer_question(self, question: str) -> dict:
"""Answer questions about the document"""
context_parts = [
f"Document content:\n{self.document_text}",
f"Visual structure:\n{self.visual_structure}"
]
if self.audio_explanation:
context_parts.append(f"Additional context:\n{self.audio_explanation}")
prompt = f"""Based on this document:
{chr(10).join(context_parts)}
Question: {question}
Answer with specific references to the document."""
answer = language_model(prompt)
return {
'question': question,
'answer': answer,
'confidence': self._assess_confidence(answer)
}
def extract_key_information(self) -> dict:
"""Extract all key information from document"""
prompt = f"""Extract key information from this document:
{self.document_text}
Visual structure hints: {self.visual_structure}
Return JSON with all important data"""
return json.loads(language_model(prompt))
def _extract_text(self) -> str:
"""Extract text from document"""
return vision_model(self.document_image, "Extract all text from this document")
def _analyze_structure(self) -> str:
"""Analyze document structure"""
return vision_model(self.document_image,
"Describe the visual structure: sections, headings, tables, etc")
def _assess_confidence(self, answer: str) -> str:
"""Assess confidence in answer"""
if 'not' in answer.lower() or 'unclear' in answer.lower():
return 'low'
return 'high'
Performance Optimization
Token Budget Management
class TokenBudgetManager:
"""Manage token usage across modalities"""
def __init__(self, budget_per_request: int = 4000):
self.budget = budget_per_request
self.used = 0
def allocate_for_modality(self, modality: str,
num_items: int = 1) -> int:
"""Allocate tokens for a modality"""
allocations = {
'image': 500 * num_items,
'audio': 300 * num_items,
'document': 400 * num_items,
'synthesis': 500
}
allocation = allocations.get(modality, 400)
if self.used + allocation > self.budget:
# Reduce quality to fit budget
reduction_factor = (self.budget - self.used) / allocation
return int(allocation * reduction_factor)
self.used += allocation
return allocation
def remaining_budget(self) -> int:
"""Get remaining token budget"""
return max(0, self.budget - self.used)
def should_process_modality(self, modality: str) -> bool:
"""Check if we have budget for modality"""
return self.remaining_budget() > 100
# Usage
budget = TokenBudgetManager(budget_per_request=4000)
# Vision takes 500 tokens
if budget.should_process_modality('image'):
vision_tokens = budget.allocate_for_modality('image')
# Process image
# Audio takes 300 tokens
if budget.should_process_modality('audio'):
audio_tokens = budget.allocate_for_modality('audio')
# Process audio
print(f"Remaining budget: {budget.remaining_budget()} tokens")
Caching Multimodal Results
import hashlib
class MultimodalCache:
"""Cache results across modalities"""
def __init__(self):
self.cache = {}
def _hash_input(self, image: bytes = None, text: str = None,
audio: str = None) -> str:
"""Create hash of inputs"""
combined = ""
if image:
combined += hashlib.md5(image).hexdigest()
if text:
combined += text
if audio:
combined += audio
return hashlib.md5(combined.encode()).hexdigest()
def get(self, image: bytes = None, text: str = None,
audio: str = None) -> dict:
"""Retrieve cached result"""
key = self._hash_input(image, text, audio)
return self.cache.get(key)
def set(self, result: dict, image: bytes = None,
text: str = None, audio: str = None):
"""Cache result"""
key = self._hash_input(image, text, audio)
self.cache[key] = result
Exercise: Build a Multimodal Application
Create a complete multimodal application for one of these:
-
Receipt & Expense Analyzer:
- Extract receipt image data
- Validate against policy
- Generate expense report
- Support optional voice notes
-
Document Interrogator:
- Load a PDF/images
- Support optional audio explanation
- Answer arbitrary questions
- Extract structured info
-
Content Reviewer:
- Accept image, text, optional audio
- Run modality-specific checks
- Synthesize moderation decision
- Report violations
Requirements:
For your chosen use case:
- Design architecture (sequential/parallel/hierarchical)
- Implement all modalities
- Create test cases
- Optimize for performance
- Include error handling
- Generate sample output
Deliverables:
- Complete application code
- Architecture diagram
- Test results (3-5 examples)
- Performance metrics
- Documentation
Summary
In this lesson, you’ve learned:
- Architecture patterns for multimodal processing
- Sequential, parallel, and hierarchical approaches
- Real-world use cases and implementations
- Performance optimization and token budgeting
- Caching strategies for multimodal results
- Complete end-to-end multimodal applications
You’ve completed the entire Intermediate phase of Prompt Engineering. You now understand:
- How to measure and optimize prompts systematically
- How to design system prompts for complex behaviors
- How to work with multiple modalities effectively
- How to build production systems with these capabilities
Next phase: Advanced techniques and specialized applications.